我试图在我的Kubernetes集群上安装Ansible AWX,但是RabbitMQ容器抛出“无法从k8s获取节点”错误。
以下是我使用的平台的版本
[node1 ~]# kubectl version
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.5",
GitCommit:"f01a2bf98249a4db383560443a59bed0c13575df", GitTreeState:"clean",
BuildDate:"2018-03-19T15:50:45Z", GoVersion:"go1.9.3", Compiler:"gc",
Platform:"linux/amd64"}
Kubernetes是通过kubespray剧本v2.5.0部署的,所有服务和Pod均已启动并正在运行。 (CoreDNS,Weave,IPtables)
我正在使用1.0.6映像通过1.0.6映像通过awx_web和awx_task部署AWX。
我正在使用v10.4上的外部PostgreSQL数据库,并已验证表是由db中的awx创建的。
我尝试过的故障排除步骤。
集群信息
[node1 ~]# kubectl get all -n awx
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deploy/awx 1 1 1 0 38m
NAME DESIRED CURRENT READY AGE
rs/awx-654f7fc84c 1 1 0 38m
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deploy/awx 1 1 1 0 38m
NAME DESIRED CURRENT READY AGE
rs/awx-654f7fc84c 1 1 0 38m
NAME READY STATUS RESTARTS AGE
po/awx-654f7fc84c-9ppqb 3/4 CrashLoopBackOff 11 38m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
svc/awx-rmq-mgmt ClusterIP 10.233.10.146 <none> 15672/TCP 1d
svc/awx-web-svc NodePort 10.233.3.75 <none> 80:31700/TCP 1d
svc/rabbitmq NodePort 10.233.37.33 <none> 15672:30434/TCP,5672:31962/TCP 1d
AWX RabbitMQ错误日志
[node1 ~]# kubectl logs -n awx awx-654f7fc84c-9ppqb awx-rabbit
2018-07-09 14:47:37.464 [info] <0.33.0> Application lager started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.767 [info] <0.33.0> Application os_mon started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.767 [info] <0.33.0> Application crypto started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.768 [info] <0.33.0> Application cowlib started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.768 [info] <0.33.0> Application xmerl started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.851 [info] <0.33.0> Application mnesia started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.851 [info] <0.33.0> Application recon started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.852 [info] <0.33.0> Application jsx started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.852 [info] <0.33.0> Application asn1 started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.852 [info] <0.33.0> Application public_key started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.897 [info] <0.33.0> Application ssl started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.901 [info] <0.33.0> Application ranch started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.901 [info] <0.33.0> Application ranch_proxy_protocol started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.901 [info] <0.33.0> Application rabbit_common started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.907 [info] <0.33.0> Application amqp_client started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.909 [info] <0.33.0> Application cowboy started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.957 [info] <0.33.0> Application inets started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.964 [info] <0.193.0>
Starting RabbitMQ 3.7.4 on Erlang 20.1.7
Copyright (C) 2007-2018 Pivotal Software, Inc.
Licensed under the MPL. See http://www.rabbitmq.com/
## ##
## ## RabbitMQ 3.7.4. Copyright (C) 2007-2018 Pivotal Software, Inc.
########## Licensed under the MPL. See http://www.rabbitmq.com/
###### ##
########## Logs: <stdout>
Starting broker...
2018-07-09 14:47:37.982 [info] <0.193.0>
node : rabbit@10.233.120.5
home dir : /var/lib/rabbitmq
config file(s) : /etc/rabbitmq/rabbitmq.conf
cookie hash : at619UOZzsenF44tSK3ulA==
log(s) : <stdout>
database dir : /var/lib/rabbitmq/mnesia/rabbit@10.233.120.5
2018-07-09 14:47:39.649 [info] <0.201.0> Memory high watermark set to 11998 MiB (12581714329 bytes) of 29997 MiB (31454285824 bytes) total
2018-07-09 14:47:39.652 [info] <0.203.0> Enabling free disk space monitoring
2018-07-09 14:47:39.653 [info] <0.203.0> Disk free limit set to 50MB
2018-07-09 14:47:39.658 [info] <0.205.0> Limiting to approx 1048476 file handles (943626 sockets)
2018-07-09 14:47:39.658 [info] <0.206.0> FHC read buffering: OFF
2018-07-09 14:47:39.658 [info] <0.206.0> FHC write buffering: ON
2018-07-09 14:47:39.660 [info] <0.193.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@10.233.120.5 is empty. Assuming we need to join an existing cluster or initialise from scratch...
2018-07-09 14:47:39.660 [info] <0.193.0> Configured peer discovery backend: rabbit_peer_discovery_k8s
2018-07-09 14:47:39.660 [info] <0.193.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
2018-07-09 14:47:39.660 [info] <0.193.0> Peer discovery backend does not support locking, falling back to randomized delay
2018-07-09 14:47:39.660 [info] <0.193.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized startup delay.
2018-07-09 14:47:39.665 [info] <0.193.0> Failed to get nodes from k8s - {failed_connect,[{to_address,{"kubernetes.default.svc.cluster.local",443}},
{inet,[inet],nxdomain}]}
2018-07-09 14:47:39.665 [error] <0.192.0> CRASH REPORT Process <0.192.0> with 0 neighbours exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 164 in application_master:init/4 line 134
2018-07-09 14:47:39.666 [info] <0.33.0> Application rabbit exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 164
{"Kernel pid terminated",application_controller,"{application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,\"{failed_connect,[{to_address,{\\"kubernetes.default.svc.cluster.local\\",443}},\n {inet,[inet],nxdomain}]}\"}},[{rabbit_mnesia,init_from_config,0,[{file,\"src/rabbit_mnesia.erl\"},{line,164}]},{rabbit_mnesia,init_with_lock,3,[{file,\"src/rabbit_mnesia.erl\"},{line,144}]},{rabbit_mnesia,init,0,[{file,\"src/rabbit_mnesia.erl\"},{line,111}]},{rabbit_boot_steps,'-run_step/2-lc$^1/1-1-',1,[{file,\"src/rabbit_boot_steps.erl\"},{line,49}]},{rabbit_boot_steps,run_step,2,[{file,\"src/rabbit_boot_steps.erl\"},{line,49}]},{rabbit_boot_steps,'-run_boot_steps/1-lc$^0/1-0-',1,[{file,\"src/rabbit_boot_steps.erl\"},{line,26}]},{rabbit_boot_steps,run_boot_steps,1,[{file,\"src/rabbit_boot_steps.erl\"},{line,26}]},{rabbit,start,2,[{file,\"src/rabbit.erl\"},{line,793}]}]}}}}}"}
Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,"{failed_connect,[{to_address,{\"kubernetes.defau
Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done
Kubernetes API服务
[node1 ~]# kubectl describe service kubernetes
Name: kubernetes
Namespace: default
Labels: component=apiserver
provider=kubernetes
Annotations: <none>
Selector: <none>
Type: ClusterIP
IP: 10.233.0.1
Port: https 443/TCP
TargetPort: 6443/TCP
Endpoints: 10.237.34.19:6443,10.237.34.21:6443
Session Affinity: ClientIP
Events: <none>
同一kubernetes集群中的busybox中的nslookup
[node2 ~]# kubectl exec -it busybox -- sh
/ # nslookup kubernetes.default.svc.cluster.local
Server: 10.233.0.3
Address 1: 10.233.0.3 coredns.kube-system.svc.cluster.local
Name: kubernetes.default.svc.cluster.local
Address 1: 10.233.0.1 kubernetes.default.svc.cluster.local
请让我知道是否缺少任何有助于故障排除的信息。
答案 0 :(得分:0)
我相信解决方案是忽略the explicit kubernetes host。我想不出有什么好理由需要从集群内部指定 kubernetes api主机。
如果出于某种可怕的原因,RMQ插件需要它,请尝试交换Service
IP(假设主服务器的SSL证书在SAN列表中具有其Service
IP)。
关于为什么这样做很愚蠢,我能想到的唯一很好的理由是RMQ PodSpec
已经以某种方式获得了dnsPolicy
比ClusterFirst
。如果您确实希望对RMQ Pod进行故障排除,则可以提供一个明确的command:
来运行一些调试bash命令,以便在启动时询问容器的状态,然后exec /launch.sh
来恢复启动RMQ(as they do)