我们在使用Corosync和Pacemaker为HA配置PostgreSQL时遇到问题。
crm_mon输出
Last updated: Thu Dec 18 10:24:04 2014
Last change: Thu Dec 18 10:16:30 2014 via crmd on umhtvappdpj05.arqiva.local
Stack: corosync
Current DC: umhtvappdpj06.arqiva.local (1) - partition with quorum
Version: 1.1.10-29.el7-368c726
2 Nodes configured
4 Resources configured
Online: [ umhtvappdpj05.arqiva.local umhtvappdpj06.arqiva.local ]
Full list of resources:
Master/Slave Set: msPostgresql [pgsql]
Masters: [ umhtvappdpj06.arqiva.local ]
Slaves: [ umhtvappdpj05.arqiva.local ]
Resource Group: master-group
vip-master (ocf::heartbeat:IPaddr2): Started umhtvappdpj06.arqiva.local
vip-rep (ocf::heartbeat:IPaddr2): Started umhtvappdpj06.arqiva.local
Node Attributes:
* Node umhtvappdpj05.arqiva.local:
+ master-pgsql : -INFINITY
+ pgsql-data-status : LATEST
+ pgsql-status : HS:alone
+ pgsql-xlog-loc : 0000000097000168
* Node umhtvappdpj06.arqiva.local:
+ master-pgsql : 1000
+ pgsql-data-status : LATEST
+ pgsql-master-baseline : 0000000094000090
+ pgsql-status : PRI
Migration summary:
* Node umhtvappdpj05.arqiva.local:
* Node umhtvappdpj06.arqiva.local:`
此处节点06(umhtvappdpj06.arqiva.local)作为主节点启动,节点05(umhtvappdpj05.arqiva.local)作为备用节点,但两者都未连接。
recovery.conf on node 05
standby_mode = 'on'
primary_conninfo = 'host=10.52.6.95 port=5432 user=postgres application_name=umhtvappdpj05.arqiva.local keepalives_idle=60 keepalives_interval=5 keepalives_count=5'
restore_command = 'scp 10.52.6.85:/var/lib/pgsql/pg_archive/%f %p'
recovery_target_timeline = 'latest'`
创建的资源是:
pcs resource create vip-master IPaddr2 \
ip="10.52.6.94" \
nic="ens192" \
cidr_netmask="24" \
op start timeout="60s" interval="0s" on-fail="restart" \
op monitor timeout="60s" interval="10s" on-fail="restart" \
op stop timeout="60s" interval="0s" on-fail="block"
pcs resource create vip-rep IPaddr2 \
ip="10.52.6.95" \
nic="ens192" \
cidr_netmask="24" \
meta migration-threshold="0" \
op start timeout="60s" interval="0s" on-fail="stop" \
op monitor timeout="60s" interval="10s" on-fail="restart" \
op stop timeout="60s" interval="0s" on-fail="ignore"
pcs resource create pgsql ocf:heartbeat:pgsql \
pgctl="/usr/pgsql-9.3/bin/pg_ctl" \
psql="/usr/pgsql-9.3/bin/psql" \
pgdata="/pgdata/data" \
rep_mode="sync" \
node_list="10.52.6.85 10.52.6.92" \
restore_command="scp 10.52.6.85:/var/lib/pgsql/pg_archive/%f %p" \
master_ip="10.52.6.95" \
primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 keepalives_count=5" \
restart_on_promote='true' \
op start timeout="60s" interval="0s" on-fail="restart" \
op monitor timeout="60s" interval="10s" on-fail="restart" \
op monitor timeout="60s" interval="9s" on-fail="restart" role="Master" \
op promote timeout="60s" interval="0s" on-fail="restart" \
op demote timeout="60s" interval="0s" on-fail="stop" \
op stop timeout="60s" interval="0s" on-fail="block" \
op notify timeout="60s" interval="0s"
[root@umhtvappdpj05 data]# pcs resource show --all
Master/Slave Set: msPostgresql [pgsql]
Masters: [ umhtvappdpj06.arqiva.local ]
Slaves: [ umhtvappdpj05.arqiva.local ]
Resource Group: master-group
vip-master (ocf::heartbeat:IPaddr2): Started
vip-rep (ocf::heartbeat:IPaddr2): Started
[root@umhtvappdpj05 data]# `
唯一的异常是corosync和起搏器首次安装在节点6上,当时它位于与节点5不同的子网上。随后节点6被转移到与5相同的子网。这可能是原因吗?也许在节点6上重新安装。似乎有意义。
谢谢
萨米尔
答案 0 :(得分:0)
这是将副本与主PostgreSQL连接的方式。
1. Touch PGSQL.lock file in /var/lib/pgsql/tmp/ in umhtvappdpj05.arqiva.local
2. Stop PostgreSQL in node umhtvappdpj05.arqiva.local using systemctl
3. Do base backup/rsync of data dir/ from primary server to replica
4. Remove PGSQL.lock file from replica
5. Run pcs resource cleanup pgsql
这些步骤始终对我有用。 让我知道是否可以。