如何在Kubernetes中使用已安装的持久卷配置RabbitMQ集群,以在整个集群重新启动时保留数据?

时间:2020-06-13 04:18:20

标签: docker kubernetes rabbitmq persistent-volumes

我正在尝试在Kubernetes集群中将节点的高可用性RabbitMQ集群设置为StatefulSet,以便即使同时重启所有节点后,我的数据(例如队列,消息)也仍然存在。由于我是在Kubernetes中部署RabbitMQ节点的,因此我了解到我需要包括一个外部持久卷,以供节点存储数据,以便数据在重新启动后仍然存在。我已经在目录/var/lib/rabbitmq/mnesia上将Azure文件共享作为卷安装到了我的容器中。

以新的(空)卷开始时,节点启动时没有任何问题,并成功地形成了集群。我可以打开RabbitMQ管理UI,并看到我创建的任何队列都按预期在所有节点上进行了镜像,并且只要有至少1个活动节点,该队列(以及其中的所有消息)将持续存在。使用kubectl delete pod rabbitmq-0 -n rabbit删除Pod将导致该节点停止然后重新启动,并且日志显示该节点已与任何剩余/活动节点成功同步,因此一切正常。

我遇到的问题是,当我同时删除集群中的所有RabbitMQ节点时,要启动的第一个节点将具有该卷中的持久数据,并尝试与其他两个节点重新集群当然,不活跃。我期望发生的事情是该节点将启动,加载队列和消息数据,然后形成一个新集群(因为它应该注意到没有其他节点处于活动状态)。

我怀疑装入的卷中可能有一些数据指示其他节点的存在,这就是为什么它试图与它们连接并加入所谓的群集的原因,但是我还没有找到一种方法来防止这种情况,并且不确定这是原因。

有两种不同的错误消息:一种是RabbitMQ节点处于崩溃循环时,在容器描述(kubectl describe pod rabbitmq-0 -n rabbit)中,另一种是在容器日志中。广告连播描述错误输出包括以下内容:

exited with 137: 20:38:12.331 [error] Cookie file /var/lib/rabbitmq/.erlang.cookie must be accessible by owner only

Error: unable to perform an operation on node 'rabbit@rabbitmq-0.rabbitmq-internal.rabbit.svc.cluster.local'. Please see diagnostics information and suggestions below.

Most common reasons for this are:

 * Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)
 * CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)
 * Target node is not running

In addition to the diagnostics info below:

 * See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more
 * Consult server logs on node rabbit@rabbitmq-0.rabbitmq-internal.rabbit.svc.cluster.local
 * If target node is configured to use long node names, don't forget to use --longnames with CLI tools

DIAGNOSTICS
===========

attempted to contact: ['rabbit@rabbitmq-0.rabbitmq-internal.rabbit.svc.cluster.local']

rabbit@rabbitmq-0.rabbitmq-internal.rabbit.svc.cluster.local:
  * connected to epmd (port 4369) on rabbitmq-0.rabbitmq-internal.rabbit.svc.cluster.local
  * epmd reports: node 'rabbit' not running at all
                  no other nodes on rabbitmq-0.rabbitmq-internal.rabbit.svc.cluster.local
  * suggestion: start the node

Current node details:
 * node name: 'rabbitmqcli-345-rabbit@rabbitmq-0.rabbitmq-internal.rabbit.svc.cluster.local'
 * effective user's home directory: /var/lib/rabbitmq
 * Erlang cookie hash: xxxxxxxxxxxxxxxxx

,日志输出以下信息:

Config file(s): /etc/rabbitmq/rabbitmq.conf

  Starting broker...2020-06-12 20:39:08.678 [info] <0.294.0> 
 node           : rabbit@rabbitmq-0.rabbitmq-internal.rabbit.svc.cluster.local
 home dir       : /var/lib/rabbitmq
 config file(s) : /etc/rabbitmq/rabbitmq.conf
 cookie hash    : xxxxxxxxxxxxxxxxx
 log(s)         : <stdout>
 database dir   : /var/lib/rabbitmq/mnesia/rabbit@rabbitmq-0.rabbitmq-internal.rabbit.svc.cluster.local

...

2020-06-12 20:48:39.015 [warning] <0.294.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@rabbitmq-2.rabbitmq-internal.rabbit.svc.cluster.local','rabbit@rabbitmq-1.rabbitmq-internal.rabbit.svc.cluster.local','rabbit@rabbitmq-0.rabbitmq-internal.rabbit.svc.cluster.local'],[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-06-12 20:48:39.015 [info] <0.294.0> Waiting for Mnesia tables for 30000 ms, 0 retries left
2020-06-12 20:49:09.341 [info] <0.44.0> Application mnesia exited with reason: stopped
2020-06-12 20:49:09.505 [error] <0.294.0> 
2020-06-12 20:49:09.505 [error] <0.294.0> BOOT FAILED
2020-06-12 20:49:09.505 [error] <0.294.0> ===========
2020-06-12 20:49:09.505 [error] <0.294.0> Timeout contacting cluster nodes: ['rabbit@rabbitmq-2.rabbitmq-internal.rabbit.svc.cluster.local',
2020-06-12 20:49:09.505 [error] <0.294.0>                                    'rabbit@rabbitmq-1.rabbitmq-internal.rabbit.svc.cluster.local'].

...

BACKGROUND
==========

This cluster node was shut down while other nodes were still running.
2020-06-12 20:49:09.506 [error] <0.294.0> 
2020-06-12 20:49:09.506 [error] <0.294.0> This cluster node was shut down while other nodes were still running.
2020-06-12 20:49:09.506 [error] <0.294.0> To avoid losing data, you should start the other nodes first, then
2020-06-12 20:49:09.506 [error] <0.294.0> start this one. To force this node to start, first invoke
To avoid losing data, you should start the other nodes first, then
start this one. To force this node to start, first invoke
"rabbitmqctl force_boot". If you do so, any changes made on other
cluster nodes after this one was shut down may be lost.

到目前为止,我一直在尝试清除/var/lib/rabbitmq/mnesia/rabbit@rabbitmq-0.rabbitmq-internal.rabbit.svc.cluster.local/nodes_running_at_shutdown文件的内容,并摆弄诸如卷装载目录和erlang cookie权限之类的配置设置。

以下是相关的部署文件和配置文件:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: rabbitmq
  namespace: rabbit
spec:
  serviceName: rabbitmq-internal
  revisionHistoryLimit: 3
  updateStrategy:
    type: RollingUpdate
  replicas: 3
  selector: 
    matchLabels:
          app: rabbitmq
  template:
    metadata:
      name: rabbitmq
      labels:
        app: rabbitmq
    spec:
      serviceAccountName: rabbitmq
      terminationGracePeriodSeconds: 10
      containers:        
      - name: rabbitmq
        image: rabbitmq:0.13
        lifecycle:
          postStart:
            exec:
              command:
                - /bin/sh
                - -c
                - >
                  until rabbitmqctl --erlang-cookie ${RABBITMQ_ERLANG_COOKIE} node_health_check; do sleep 1; done;
                  rabbitmqctl --erlang-cookie ${RABBITMQ_ERLANG_COOKIE} set_policy ha-all "" '{"ha-mode":"all", "ha-sync-mode": "automatic"}'
        ports:
        - containerPort: 4369
        - containerPort: 5672
        - containerPort: 5671
        - containerPort: 25672
        - containerPort: 15672
        resources:
          requests:
            memory: "500Mi"
            cpu: "0.4"
          limits:
            memory: "600Mi"
            cpu: "0.6"
        livenessProbe:
          exec:
            # Stage 2 check:
            command: ["rabbitmq-diagnostics", "status", "--erlang-cookie", "$(RABBITMQ_ERLANG_COOKIE)"]
          initialDelaySeconds: 60
          periodSeconds: 60
          timeoutSeconds: 15
        readinessProbe:
          exec:
            # Stage 2 check:
            command: ["rabbitmq-diagnostics", "status", "--erlang-cookie", "$(RABBITMQ_ERLANG_COOKIE)"]
          initialDelaySeconds: 20
          periodSeconds: 60
          timeoutSeconds: 10
        envFrom:
         - configMapRef:
             name: rabbitmq-cfg
        env:
          - name: HOSTNAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: NAMESPACE
            valueFrom:
              fieldRef:
                fieldPath: metadata.namespace
          - name: RABBITMQ_USE_LONGNAME
            value: "true"
          - name: RABBITMQ_NODENAME
            value: "rabbit@$(HOSTNAME).rabbitmq-internal.$(NAMESPACE).svc.cluster.local"
          - name: K8S_SERVICE_NAME
            value: "rabbitmq-internal"
          - name: RABBITMQ_DEFAULT_USER
            value: user
          - name: RABBITMQ_DEFAULT_PASS
            value: pass
          - name: RABBITMQ_ERLANG_COOKIE
            value: my-cookie
          - name: NODE_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
        volumeMounts:
        - name: my-volume-mount
          mountPath: "/var/lib/rabbitmq/mnesia"
      imagePullSecrets:
      - name: my-secret
      volumes:
        - name: my-volume-mount
          azureFile:
            secretName: azure-rabbitmq-secret
            shareName: my-fileshare-name
            readOnly: false
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: rabbitmq-cfg
  namespace: rabbit
data:
  RABBITMQ_VM_MEMORY_HIGH_WATERMARK: "0.6"
---
kind: Service
apiVersion: v1
metadata:
  namespace: rabbit
  name: rabbitmq-internal
  labels:
    app: rabbitmq
spec:
  clusterIP: None
  ports:
    - name: http
      protocol: TCP
      port: 15672
    - name: amqp
      protocol: TCP
      port: 5672
    - name: amqps
      protocol: TCP
      port: 5671
  selector:
    app: rabbitmq  
---
kind: Service
apiVersion: v1
metadata:
  namespace: rabbit
  name: rabbitmq
  labels:
    app: rabbitmq
    type: LoadBalancer
spec:
  selector:
    app: rabbitmq
  ports:
   - name: http
     protocol: TCP
     port: 15672
     targetPort: 15672
   - name: amqp
     protocol: TCP
     port: 5672
     targetPort: 5672
   - name: amqps
     protocol: TCP
     port: 5671
     targetPort: 5671

Dockerfile:

FROM rabbitmq:3.8.4
COPY conf/rabbitmq.conf /etc/rabbitmq
COPY conf/enabled_plugins /etc/rabbitmq

USER root
COPY conf/.erlang.cookie /var/lib/rabbitmq
RUN /bin/bash -c 'ls -ld /var/lib/rabbitmq/.erlang.cookie; chmod 600 /var/lib/rabbitmq/.erlang.cookie; ls -ld /var/lib/rabbitmq/.erlang.cookie'

rabbitmq.conf

## cluster formation settings
cluster_formation.peer_discovery_backend  = rabbit_peer_discovery_k8s
cluster_formation.k8s.host = kubernetes.default.svc.cluster.local
cluster_formation.k8s.address_type = hostname
cluster_formation.k8s.service_name = rabbitmq-internal
cluster_formation.k8s.hostname_suffix = .rabbitmq-internal.rabbit.svc.cluster.local
cluster_formation.node_cleanup.interval = 60
cluster_formation.node_cleanup.only_log_warning = true
cluster_partition_handling = autoheal
queue_master_locator=min-masters

## general settings
log.file.level = debug

## Mgmt UI secure/non-secure connection settings (secure not implemented yet)
management.tcp.port       = 15672

## RabbitMQ entrypoint settings (will be injected below when image is built)

谢谢!

0 个答案:

没有答案