PostgreSQL / PostDock:主节点中的自动恢复失败

时间:2018-11-16 10:53:27

标签: postgresql pgpool automatic-failover repmgr

我使用Docker服务和Docker群来部署PostDock集群

这是我的docker-compose.yml设置:

version: "3.3"
networks:
  postdock:
    external: true

services:      
  pgmaster:
    image: postdock/postgres
    environment:
      PARTNER_NODES: "pgmaster,pgslave1"
      CLUSTER_NODE_NETWORK_NAME: pgmaster
      NODE_PRIORITY: 100  
      NODE_ID: 1
      NODE_NAME: pgmaster
      POSTGRES_PASSWORD: 123
      POSTGRES_USER: postgres
      POSTGRES_DB: postgres
      CONFIGS: "listen_addresses:'*'"
      CLUSTER_NAME: pg_cluster 
      REPLICATION_DB: replication_db
      REPLICATION_USER: replication_user 
      REPLICATION_PASSWORD: replication_pass 
    ports:
      - 4000:5432
    volumes:
      - /data/master_slave:/var/lib/postgresql/data
    networks:
      - postdock
    deploy:
      placement:
        constraints:
          - node.role == manager
          - node.hostname == 192.168.1.161

  pgslave1:
    image: postdock/postgres
    environment:
      PARTNER_NODES: "pgmaster,pgslave1"
      REPLICATION_PRIMARY_HOST: pgmaster
      NODE_ID: 2
      NODE_NAME: pgslave1
      CLUSTER_NODE_NETWORK_NAME: pgslave1 
      REPLICATION_PRIMARY_PORT: 5432
      CONFIGS: "max_replication_slots:10"
    ports:
      - 4001:5432
    volumes:
      - /data/slave_1:/var/lib/postgresql/data
    networks:
      - postdock
    deploy:
      placement:
        constraints:
          - node.role == manager
          - node.hostname == 192.168.1.161

  pgslave2:
    image: postdock/postgres
    environment:
      PARTNER_NODES: "pgmaster,pgslave1,pgslave2"
      REPLICATION_PRIMARY_HOST: pgmaster
      NODE_ID: 3
      NODE_NAME: pgslave2
      CLUSTER_NODE_NETWORK_NAME: pgslave2 
      REPLICATION_PRIMARY_PORT: 5432
      CONFIGS: "max_replication_slots:10"
    ports:
      - 4002:5432
    volumes:
      - /data/slave_2:/var/lib/postgresql/data
    networks:
      - postdock
    deploy:
      placement:
        constraints:
          - node.role == manager
          - node.hostname == 192.168.1.161

  db:
    image: postdock/pgpool
    environment:
      PCP_USER: pcp_user
      PCP_PASSWORD: pcp_pass
      WAIT_BACKEND_TIMEOUT: 60
      CHECK_USER: postgres
      CHECK_PASSWORD: 123
      CHECK_PGCONNECT_TIMEOUT: 3 
      DB_USERS: postgres:123
      BACKENDS: "0:pgmaster:5432:1:/var/lib/postgresql/data:ALLOW_TO_FAILOVER,1:pgslave1::::,2:pgslave2::::," 
      REQUIRE_MIN_BACKENDS: 1 
      CONFIGS: "num_init_children:250,max_pool:4"
    ports:
      - 4003:5432
      - 9899:9898
    networks:
      - postdock
    deploy:
      placement:
        constraints:
          - node.role == manager
          - node.hostname == 192.168.1.161

我跑步:

docker network create -d overlay postdock

docker stack deploy -c docker-compose.yml postdock

一切顺利。

但是,在我多次更新服务之后,主节点上的自动故障转移失败了。在主节点日志文件中,我注意到恢复过程无法检测到数据库Replication_db和架构Replication_db.public:

>>> Waiting for local postgres server start...,
expr: non-integer argument,
>>> Wait schema . on pgmaster:5432(user: public,password: *******), will try  times with delay 10 seconds (TIMEOUT=)

如您所见,没有指定架构,只有点号“。” ,并且用户也是错误的:它应该是 replication_user ,而不是用户公开

这将导致此错误消息:

2018-11-16 04:45:33.310 UTC [122] FATAL:  password authentication failed for user "public",
2018-11-16 04:45:33.310 UTC [122] DETAIL:  Role "public" does not exist.,
    Connection matched pg_hba.conf line 95: "host all all all md5",
psql: FATAL:  password authentication failed for user "public",
2018-11-16 04:45:37.974 UTC [125] FATAL:  no PostgreSQL user name specified in startup packet,
2018-11-16 04:45:39.345 UTC [127] FATAL:  no PostgreSQL user name specified in startup packet,
2018-11-16 04:45:40.374 UTC [128] FATAL:  no PostgreSQL user name specified in startup packet,
2018-11-16 04:45:41.386 UTC [129] FATAL:  no PostgreSQL user name specified in startup packet,
2018-11-16 04:45:42.421 UTC [130] FATAL:  no PostgreSQL user name specified in startup packet,
>>>>>> Host pgmaster:5432 is not accessible (will try  times more),
expr: non-integer argument,

据我了解,自动故障转移成功后,预期的恢复日志应为:

>>> Waiting for local postgres server start...,
>>> Wait schema replication_db.public on pgmaster:5432(user: replication_user,password: *******), will try 9 times with delay 10 seconds (TIMEOUT=90),
>>>>>> Schema replication_db.public exists on host pgmaster:5432!,
>>> Registering node with role master

有人知道这个问题的根本原因吗?

0 个答案:

没有答案