首页 文章

Kubernetes中的Ansible AWX RabbitMQ容器无法通过nxdomain从k8s获取节点

提问于
浏览
1

我试图在我的Kubernetes集群上安装Ansible AWX,但是RabbitMQ容器正在抛出“无法从k8s获取节点”错误 .

Below are the version of platforms I am using

[node1 ~]# kubectl version
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.5", 
GitCommit:"f01a2bf98249a4db383560443a59bed0c13575df", GitTreeState:"clean", 
BuildDate:"2018-03-19T15:50:45Z", GoVersion:"go1.9.3", Compiler:"gc", 
Platform:"linux/amd64"}

Kubernetes通过kubespray playbook v2.5.0部署,所有服务和pod都已启动并运行 . (CoreDNS,Weave,IPtables)

我正在使用1.0.6映像为awx_web和awx_task通过1.0.6版本部署AWX .

我在v10.4使用外部PostgreSQL数据库,并验证了数据库中的awx正在创建表 .

Troubleshooting steps I have tried.

  • 我尝试将带有etcd pod的AWX 1.0.5部署到同一个集群,它已按预期工作

  • 我在同一个k8s群集中部署了一个独立的RabbitMQ cluster,试图尽可能地模仿AWX兔子部署,它可以与rabbit_peer_discovery_k8s后端一起使用 .

  • 我试过为AWX 1.0.6发送一些rabbitmq.conf,没有运气,只是一直保持同样的错误 .

  • 我已经验证/etc/resolv.conf文件中有kubernetes.default.svc.cluster.local条目

Cluster Info

[node1 ~]# kubectl get all -n awx
NAME         DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
deploy/awx   1         1         1            0           38m

NAME                DESIRED   CURRENT   READY     AGE
rs/awx-654f7fc84c   1         1         0         38m

NAME         DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
deploy/awx   1         1         1            0           38m

NAME                DESIRED   CURRENT   READY     AGE
rs/awx-654f7fc84c   1         1         0         38m

NAME                      READY     STATUS             RESTARTS   AGE
po/awx-654f7fc84c-9ppqb   3/4       CrashLoopBackOff   11         38m

NAME               TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                          AGE
svc/awx-rmq-mgmt   ClusterIP   10.233.10.146   <none>        15672/TCP                        1d
svc/awx-web-svc    NodePort    10.233.3.75     <none>        80:31700/TCP                     1d
svc/rabbitmq       NodePort    10.233.37.33    <none>        15672:30434/TCP,5672:31962/TCP   1d

AWX RabbitMQ错误日志

[node1 ~]# kubectl logs -n awx awx-654f7fc84c-9ppqb awx-rabbit
2018-07-09 14:47:37.464 [info] <0.33.0> Application lager started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.767 [info] <0.33.0> Application os_mon started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.767 [info] <0.33.0> Application crypto started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.768 [info] <0.33.0> Application cowlib started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.768 [info] <0.33.0> Application xmerl started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.851 [info] <0.33.0> Application mnesia started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.851 [info] <0.33.0> Application recon started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.852 [info] <0.33.0> Application jsx started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.852 [info] <0.33.0> Application asn1 started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.852 [info] <0.33.0> Application public_key started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.897 [info] <0.33.0> Application ssl started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.901 [info] <0.33.0> Application ranch started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.901 [info] <0.33.0> Application ranch_proxy_protocol started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.901 [info] <0.33.0> Application rabbit_common started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.907 [info] <0.33.0> Application amqp_client started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.909 [info] <0.33.0> Application cowboy started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.957 [info] <0.33.0> Application inets started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.964 [info] <0.193.0>
 Starting RabbitMQ 3.7.4 on Erlang 20.1.7
 Copyright (C) 2007-2018 Pivotal Software, Inc.
 Licensed under the MPL.  See http://www.rabbitmq.com/

  ##  ##
  ##  ##      RabbitMQ 3.7.4. Copyright (C) 2007-2018 Pivotal Software, Inc.
  ##########  Licensed under the MPL.  See http://www.rabbitmq.com/
  ######  ##
  ##########  Logs: <stdout>

              Starting broker...
2018-07-09 14:47:37.982 [info] <0.193.0>
 node           : rabbit@10.233.120.5
 home dir       : /var/lib/rabbitmq
 config file(s) : /etc/rabbitmq/rabbitmq.conf
 cookie hash    : at619UOZzsenF44tSK3ulA==
 log(s)         : <stdout>
 database dir   : /var/lib/rabbitmq/mnesia/rabbit@10.233.120.5
2018-07-09 14:47:39.649 [info] <0.201.0> Memory high watermark set to 11998 MiB (12581714329 bytes) of 29997 MiB (31454285824 bytes) total
2018-07-09 14:47:39.652 [info] <0.203.0> Enabling free disk space monitoring
2018-07-09 14:47:39.653 [info] <0.203.0> Disk free limit set to 50MB
2018-07-09 14:47:39.658 [info] <0.205.0> Limiting to approx 1048476 file handles (943626 sockets)
2018-07-09 14:47:39.658 [info] <0.206.0> FHC read buffering:  OFF
2018-07-09 14:47:39.658 [info] <0.206.0> FHC write buffering: ON
2018-07-09 14:47:39.660 [info] <0.193.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@10.233.120.5 is empty. Assuming we need to join an existing cluster or initialise from scratch...
2018-07-09 14:47:39.660 [info] <0.193.0> Configured peer discovery backend: rabbit_peer_discovery_k8s
2018-07-09 14:47:39.660 [info] <0.193.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
2018-07-09 14:47:39.660 [info] <0.193.0> Peer discovery backend does not support locking, falling back to randomized delay
2018-07-09 14:47:39.660 [info] <0.193.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized startup delay.
2018-07-09 14:47:39.665 [info] <0.193.0> Failed to get nodes from k8s - {failed_connect,[{to_address,{"kubernetes.default.svc.cluster.local",443}},
                 {inet,[inet],nxdomain}]}
2018-07-09 14:47:39.665 [error] <0.192.0> CRASH REPORT Process <0.192.0> with 0 neighbours exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n                 {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 164 in application_master:init/4 line 134
2018-07-09 14:47:39.666 [info] <0.33.0> Application rabbit exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n                 {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 164
{"Kernel pid terminated",application_controller,"{application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,\"{failed_connect,[{to_address,{\\"kubernetes.default.svc.cluster.local\\",443}},\n                 {inet,[inet],nxdomain}]}\"}},[{rabbit_mnesia,init_from_config,0,[{file,\"src/rabbit_mnesia.erl\"},{line,164}]},{rabbit_mnesia,init_with_lock,3,[{file,\"src/rabbit_mnesia.erl\"},{line,144}]},{rabbit_mnesia,init,0,[{file,\"src/rabbit_mnesia.erl\"},{line,111}]},{rabbit_boot_steps,'-run_step/2-lc$^1/1-1-',1,[{file,\"src/rabbit_boot_steps.erl\"},{line,49}]},{rabbit_boot_steps,run_step,2,[{file,\"src/rabbit_boot_steps.erl\"},{line,49}]},{rabbit_boot_steps,'-run_boot_steps/1-lc$^0/1-0-',1,[{file,\"src/rabbit_boot_steps.erl\"},{line,26}]},{rabbit_boot_steps,run_boot_steps,1,[{file,\"src/rabbit_boot_steps.erl\"},{line,26}]},{rabbit,start,2,[{file,\"src/rabbit.erl\"},{line,793}]}]}}}}}"}
Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,"{failed_connect,[{to_address,{\"kubernetes.defau

Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done

Kubernetes API服务

[node1 ~]# kubectl describe service kubernetes
Name:              kubernetes
Namespace:         default
Labels:            component=apiserver
                provider=kubernetes
Annotations:       <none>
Selector:          <none>
Type:              ClusterIP
IP:                10.233.0.1
Port:              https  443/TCP
TargetPort:        6443/TCP
Endpoints:         10.237.34.19:6443,10.237.34.21:6443
Session Affinity:  ClientIP
Events:            <none>

nslookup来自同一个kubernetes集群中的busybox

[node2 ~]# kubectl exec -it busybox -- sh
/ # nslookup  kubernetes.default.svc.cluster.local
Server:    10.233.0.3
Address 1: 10.233.0.3 coredns.kube-system.svc.cluster.local

Name:      kubernetes.default.svc.cluster.local
Address 1: 10.233.0.1 kubernetes.default.svc.cluster.local

如果我遗漏任何有助于排除故障的信息,请告诉我 .

1 回答

  • 0

    我相信解决方案是省略the explicit kubernetes host . 我想不出有什么好的理由需要从集群内部指定kubernetes api主机 .

    如果出于某种可怕的原因,RMQ插件需要它,那么尝试交换 Service IP(假设您的主服务器的SSL证书在SANs列表中有 Service IP) .


    至于为什么它会做这么愚蠢的事情,我能想到的唯一好理由是RMQ PodSpec 以某种方式获得了 ClusterFirst 以外的其他东西 . 如果您真的希望对RMQ Pod进行故障排除,那么您可以首先提供一个显式 command: 来运行一些调试bash命令,以便在启动时查询容器的状态,然后 exec /launch.sh 以恢复启动RMQ(as they do

相关问题