我正在尝试在Kubernetes集群中将节点的高可用性RabbitMQ集群设置为StatefulSet,以便即使同时重启所有节点后,我的数据(例如队列,消息)也仍然存在。由于我是在Kubernetes中部署RabbitMQ节点的,因此我了解到我需要包括一个外部持久卷,以供节点存储数据,以便数据在重新启动后仍然存在。我已经在目录/var/lib/rabbitmq/mnesia
上将Azure文件共享作为卷安装到了我的容器中。
以新的(空)卷开始时,节点启动时没有任何问题,并成功地形成了集群。我可以打开RabbitMQ管理UI,并看到我创建的任何队列都按预期在所有节点上进行了镜像,并且只要有至少1个活动节点,该队列(以及其中的所有消息)将持续存在。使用kubectl delete pod rabbitmq-0 -n rabbit
删除Pod将导致该节点停止然后重新启动,并且日志显示该节点已与任何剩余/活动节点成功同步,因此一切正常。
我遇到的问题是,当我同时删除集群中的所有RabbitMQ节点时,要启动的第一个节点将具有该卷中的持久数据,并尝试与其他两个节点重新集群当然,不活跃。我期望发生的事情是该节点将启动,加载队列和消息数据,然后形成一个新集群(因为它应该注意到没有其他节点处于活动状态)。
我怀疑装入的卷中可能有一些数据指示其他节点的存在,这就是为什么它试图与它们连接并加入所谓的群集的原因,但是我还没有找到一种方法来防止这种情况,并且不确定这是原因。
有两种不同的错误消息:一种是RabbitMQ节点处于崩溃循环时,在容器描述(kubectl describe pod rabbitmq-0 -n rabbit
)中,另一种是在容器日志中。广告连播描述错误输出包括以下内容:
exited with 137:
20:38:12.331 [error] Cookie file /var/lib/rabbitmq/.erlang.cookie must be accessible by owner only
Error: unable to perform an operation on node 'rabbit@rabbitmq-0.rabbitmq-internal.rabbit.svc.cluster.local'. Please see diagnostics information and suggestions below.
Most common reasons for this are:
* Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)
* CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)
* Target node is not running
In addition to the diagnostics info below:
* See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more
* Consult server logs on node rabbit@rabbitmq-0.rabbitmq-internal.rabbit.svc.cluster.local
* If target node is configured to use long node names, don't forget to use --longnames with CLI tools
DIAGNOSTICS
===========
attempted to contact: ['rabbit@rabbitmq-0.rabbitmq-internal.rabbit.svc.cluster.local']
rabbit@rabbitmq-0.rabbitmq-internal.rabbit.svc.cluster.local:
* connected to epmd (port 4369) on rabbitmq-0.rabbitmq-internal.rabbit.svc.cluster.local
* epmd reports: node 'rabbit' not running at all
no other nodes on rabbitmq-0.rabbitmq-internal.rabbit.svc.cluster.local
* suggestion: start the node
Current node details:
* node name: 'rabbitmqcli-345-rabbit@rabbitmq-0.rabbitmq-internal.rabbit.svc.cluster.local'
* effective user's home directory: /var/lib/rabbitmq
* Erlang cookie hash: xxxxxxxxxxxxxxxxx
,日志输出以下信息:
Config file(s): /etc/rabbitmq/rabbitmq.conf
Starting broker...2020-06-12 20:39:08.678 [info] <0.294.0>
node : rabbit@rabbitmq-0.rabbitmq-internal.rabbit.svc.cluster.local
home dir : /var/lib/rabbitmq
config file(s) : /etc/rabbitmq/rabbitmq.conf
cookie hash : xxxxxxxxxxxxxxxxx
log(s) : <stdout>
database dir : /var/lib/rabbitmq/mnesia/rabbit@rabbitmq-0.rabbitmq-internal.rabbit.svc.cluster.local
...
2020-06-12 20:48:39.015 [warning] <0.294.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@rabbitmq-2.rabbitmq-internal.rabbit.svc.cluster.local','rabbit@rabbitmq-1.rabbitmq-internal.rabbit.svc.cluster.local','rabbit@rabbitmq-0.rabbitmq-internal.rabbit.svc.cluster.local'],[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-06-12 20:48:39.015 [info] <0.294.0> Waiting for Mnesia tables for 30000 ms, 0 retries left
2020-06-12 20:49:09.341 [info] <0.44.0> Application mnesia exited with reason: stopped
2020-06-12 20:49:09.505 [error] <0.294.0>
2020-06-12 20:49:09.505 [error] <0.294.0> BOOT FAILED
2020-06-12 20:49:09.505 [error] <0.294.0> ===========
2020-06-12 20:49:09.505 [error] <0.294.0> Timeout contacting cluster nodes: ['rabbit@rabbitmq-2.rabbitmq-internal.rabbit.svc.cluster.local',
2020-06-12 20:49:09.505 [error] <0.294.0> 'rabbit@rabbitmq-1.rabbitmq-internal.rabbit.svc.cluster.local'].
...
BACKGROUND
==========
This cluster node was shut down while other nodes were still running.
2020-06-12 20:49:09.506 [error] <0.294.0>
2020-06-12 20:49:09.506 [error] <0.294.0> This cluster node was shut down while other nodes were still running.
2020-06-12 20:49:09.506 [error] <0.294.0> To avoid losing data, you should start the other nodes first, then
2020-06-12 20:49:09.506 [error] <0.294.0> start this one. To force this node to start, first invoke
To avoid losing data, you should start the other nodes first, then
start this one. To force this node to start, first invoke
"rabbitmqctl force_boot". If you do so, any changes made on other
cluster nodes after this one was shut down may be lost.
到目前为止,我一直在尝试清除/var/lib/rabbitmq/mnesia/rabbit@rabbitmq-0.rabbitmq-internal.rabbit.svc.cluster.local/nodes_running_at_shutdown
文件的内容,并摆弄诸如卷装载目录和erlang cookie权限之类的配置设置。
以下是相关的部署文件和配置文件:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: rabbitmq
namespace: rabbit
spec:
serviceName: rabbitmq-internal
revisionHistoryLimit: 3
updateStrategy:
type: RollingUpdate
replicas: 3
selector:
matchLabels:
app: rabbitmq
template:
metadata:
name: rabbitmq
labels:
app: rabbitmq
spec:
serviceAccountName: rabbitmq
terminationGracePeriodSeconds: 10
containers:
- name: rabbitmq
image: rabbitmq:0.13
lifecycle:
postStart:
exec:
command:
- /bin/sh
- -c
- >
until rabbitmqctl --erlang-cookie ${RABBITMQ_ERLANG_COOKIE} node_health_check; do sleep 1; done;
rabbitmqctl --erlang-cookie ${RABBITMQ_ERLANG_COOKIE} set_policy ha-all "" '{"ha-mode":"all", "ha-sync-mode": "automatic"}'
ports:
- containerPort: 4369
- containerPort: 5672
- containerPort: 5671
- containerPort: 25672
- containerPort: 15672
resources:
requests:
memory: "500Mi"
cpu: "0.4"
limits:
memory: "600Mi"
cpu: "0.6"
livenessProbe:
exec:
# Stage 2 check:
command: ["rabbitmq-diagnostics", "status", "--erlang-cookie", "$(RABBITMQ_ERLANG_COOKIE)"]
initialDelaySeconds: 60
periodSeconds: 60
timeoutSeconds: 15
readinessProbe:
exec:
# Stage 2 check:
command: ["rabbitmq-diagnostics", "status", "--erlang-cookie", "$(RABBITMQ_ERLANG_COOKIE)"]
initialDelaySeconds: 20
periodSeconds: 60
timeoutSeconds: 10
envFrom:
- configMapRef:
name: rabbitmq-cfg
env:
- name: HOSTNAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: RABBITMQ_USE_LONGNAME
value: "true"
- name: RABBITMQ_NODENAME
value: "rabbit@$(HOSTNAME).rabbitmq-internal.$(NAMESPACE).svc.cluster.local"
- name: K8S_SERVICE_NAME
value: "rabbitmq-internal"
- name: RABBITMQ_DEFAULT_USER
value: user
- name: RABBITMQ_DEFAULT_PASS
value: pass
- name: RABBITMQ_ERLANG_COOKIE
value: my-cookie
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
volumeMounts:
- name: my-volume-mount
mountPath: "/var/lib/rabbitmq/mnesia"
imagePullSecrets:
- name: my-secret
volumes:
- name: my-volume-mount
azureFile:
secretName: azure-rabbitmq-secret
shareName: my-fileshare-name
readOnly: false
---
apiVersion: v1
kind: ConfigMap
metadata:
name: rabbitmq-cfg
namespace: rabbit
data:
RABBITMQ_VM_MEMORY_HIGH_WATERMARK: "0.6"
---
kind: Service
apiVersion: v1
metadata:
namespace: rabbit
name: rabbitmq-internal
labels:
app: rabbitmq
spec:
clusterIP: None
ports:
- name: http
protocol: TCP
port: 15672
- name: amqp
protocol: TCP
port: 5672
- name: amqps
protocol: TCP
port: 5671
selector:
app: rabbitmq
---
kind: Service
apiVersion: v1
metadata:
namespace: rabbit
name: rabbitmq
labels:
app: rabbitmq
type: LoadBalancer
spec:
selector:
app: rabbitmq
ports:
- name: http
protocol: TCP
port: 15672
targetPort: 15672
- name: amqp
protocol: TCP
port: 5672
targetPort: 5672
- name: amqps
protocol: TCP
port: 5671
targetPort: 5671
Dockerfile:
FROM rabbitmq:3.8.4
COPY conf/rabbitmq.conf /etc/rabbitmq
COPY conf/enabled_plugins /etc/rabbitmq
USER root
COPY conf/.erlang.cookie /var/lib/rabbitmq
RUN /bin/bash -c 'ls -ld /var/lib/rabbitmq/.erlang.cookie; chmod 600 /var/lib/rabbitmq/.erlang.cookie; ls -ld /var/lib/rabbitmq/.erlang.cookie'
rabbitmq.conf
## cluster formation settings
cluster_formation.peer_discovery_backend = rabbit_peer_discovery_k8s
cluster_formation.k8s.host = kubernetes.default.svc.cluster.local
cluster_formation.k8s.address_type = hostname
cluster_formation.k8s.service_name = rabbitmq-internal
cluster_formation.k8s.hostname_suffix = .rabbitmq-internal.rabbit.svc.cluster.local
cluster_formation.node_cleanup.interval = 60
cluster_formation.node_cleanup.only_log_warning = true
cluster_partition_handling = autoheal
queue_master_locator=min-masters
## general settings
log.file.level = debug
## Mgmt UI secure/non-secure connection settings (secure not implemented yet)
management.tcp.port = 15672
## RabbitMQ entrypoint settings (will be injected below when image is built)
谢谢!