我有三个节点的PostgreSQL集群,其中一个是主节点(192.168.50.3),两个是从节点(192.168.50.4和192.168.50.5)。在从属节点上运行以下命令以进行基本备份。
pg_basebackup -h 192.168.50.3 -D <postgres_data_dir> -U <replication_user_name> --wal-method=stream -d 'sslmode=require sslcompression=0'
上述命令一旦返回0(成功),就会创建recovery.conf
文件。
recovery.conf
在从属节点上如下所示:
standby_mode = 'on'
primary_conninfo = 'host=192.168.50.3 port=5432 user=myuser password=<password_here> sslmode=require sslcompression=0'
trigger_file = '/tmp/make_master'
recovery_target_timeline = 'latest'
现在,当我在从属节点上启动PostgreSQL服务时,复制工作正常。 现在进行故障转移,关闭主服务器(192.168.50.3),然后升级从服务器(192.168.50.4),然后尝试将从服务器(192.168.50.5)指向新的主服务器(192.168.50.4)。为此,请执行以下操作:
pg_rewind
/usr/pgsql-11/bin/pg_rewind -D <data_dir_path> --source-server="port=5432 user=<username> host=192.168.50.4"
standby_mode = 'on'
primary_conninfo = 'host=192.168.50.4 port=5432 user=myuser password=<password_here> sslmode=require sslcompression=0'
trigger_file = '/tmp/make_master'
recovery_target_timeline = 'latest'
LOG: invalid resource manager ID <some_id_here>
或 PostgreSQL日志一直在说
postgres: startup recovering 000000060000000000
无法弄清楚这里出了什么问题。
在加入新的主服务器(192.168.50.4)之前,是否需要确保未进行从服务器(192.168.50.5)复制。
我应该首先升级从服务器(192.168.50.5),然后与新的主服务器(192.168.50.4)一起加入群集,并始终从192.168.50.4进行新备份,而不是使用pg_rewind
还有其他需要遵循的标准做法吗?
从站(192.168.50.5)的日志。
做了以下内容:
1.升级192.168.50.5,然后使用pg_rewind
May 20 09:24:24 myhost postgres[23471]: [11-1] 2019-05-20 09:24:24 UTC LOG: received promote request
May 20 09:24:24 myhost postgres[23471]: [12-1] 2019-05-20 09:24:24 UTC LOG: redo done at 0/8065B60
May 20 09:24:24 myhost postgres[23471]: [13-1] 2019-05-20 09:24:24 UTC LOG: selected new timeline ID: 2
May 20 09:24:25 myhost postgres[23471]: [14-1] 2019-05-20 09:24:25 UTC LOG: archive recovery complete
May 20 09:24:25 myhost postgres[23463]: [7-1] 2019-05-20 09:24:25 UTC LOG: database system is ready to accept connections
May 20 09:25:35 myhost postgres[23463]: [8-1] 2019-05-20 09:25:35 UTC LOG: received fast shutdown request
May 20 09:25:35 myhost postgres[23463]: [9-1] 2019-05-20 09:25:35 UTC LOG: aborting any active transactions
May 20 09:25:35 myhost postgres[23650]: [8-1] 2019-05-20 09:25:35 UTC FATAL: terminating connection due to administrator com
mand
May 20 09:25:35 myhost postgres[23463]: [10-1] 2019-05-20 09:25:35 UTC LOG: background worker "logical replication launcher"
(PID 23635) exited with exit code 1
May 20 09:25:35 myhost postgres[23472]: [6-1] 2019-05-20 09:25:35 UTC LOG: shutting down
May 20 09:25:35 myhost postgres[23463]: [11-1] 2019-05-20 09:25:35 UTC LOG: database system is shut down
May 20 09:25:51 myhost postgres[25121]: [1-1] 2019-05-20 09:25:51 UTC LOG: listening on IPv4 address "0.0.0.0", port 5432
May 20 09:25:51 myhost postgres[25121]: [2-1] 2019-05-20 09:25:51 UTC LOG: could not create IPv6 socket for address "::": Ad
dress family not supported by protocol
May 20 09:25:51 myhost postgres[25121]: [3-1] 2019-05-20 09:25:51 UTC LOG: listening on Unix socket "/var/run/postgresql/.s.
PGSQL.5432"
May 20 09:25:51 myhost postgres[25121]: [4-1] 2019-05-20 09:25:51 UTC LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
May 20 09:25:51 myhost postgres[25121]: [5-1] 2019-05-20 09:25:51 UTC LOG: ending log output to stderr
May 20 09:25:51 myhost postgres[25121]: [5-2] 2019-05-20 09:25:51 UTC HINT: Future log output will go to log destination "sy
slog".
May 20 09:25:51 myhost postgres[25129]: [6-1] 2019-05-20 09:25:51 UTC LOG: database system was shut down at 2019-05-20 09:25
:35 UTC
May 20 09:25:51 myhost postgres[25121]: [6-1] 2019-05-20 09:25:51 UTC LOG: database system is ready to accept connections
May 20 09:25:58 myhost postgres[25373]: [7-1] 2019-05-20 09:25:58 UTC LOG: could not receive data from client: Connection re
set by peer
May 20 09:26:07 myhost postgres[25121]: [7-1] 2019-05-20 09:26:07 UTC LOG: received fast shutdown request
May 20 09:26:07 myhost postgres[25121]: [8-1] 2019-05-20 09:26:07 UTC LOG: aborting any active transactions
May 20 09:26:07 myhost postgres[25496]: [7-1] 2019-05-20 09:26:07 UTC FATAL: terminating connection due to administrator com
mand
May 20 09:26:07 myhost postgres[25303]: [7-1] 2019-05-20 09:26:07 UTC FATAL: terminating connection due to administrator com
mand
May 20 09:26:07 myhost postgres[25478]: [7-1] 2019-05-20 09:26:07 UTC FATAL: terminating connection due to administrator com
mand
May 20 09:26:07 myhost postgres[25121]: [9-1] 2019-05-20 09:26:07 UTC LOG: background worker "logical replication launcher"
(PID 25138) exited with exit code 1
May 20 09:26:07 myhost postgres[25133]: [6-1] 2019-05-20 09:26:07 UTC LOG: shutting down
May 20 09:26:07 myhost postgres[25121]: [10-1] 2019-05-20 09:26:07 UTC LOG: database system is shut down
May 20 09:26:17 myhost postgres[25661]: [1-1] 2019-05-20 09:26:17 UTC LOG: listening on IPv4 address "0.0.0.0", port 5432
May 20 09:26:17 myhost postgres[25661]: [2-1] 2019-05-20 09:26:17 UTC LOG: could not create IPv6 socket for address "::": Ad
dress family not supported by protocol
May 20 09:26:17 myhost postgres[25661]: [3-1] 2019-05-20 09:26:17 UTC LOG: listening on Unix socket "/var/run/postgresql/.s.
PGSQL.5432"
May 20 09:26:17 myhost postgres[25661]: [4-1] 2019-05-20 09:26:17 UTC LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
May 20 09:26:17 myhost postgres[25661]: [5-1] 2019-05-20 09:26:17 UTC LOG: ending log output to stderr
May 20 09:26:17 myhost postgres[25661]: [5-2] 2019-05-20 09:26:17 UTC HINT: Future log output will go to log destination "sy
slog".
May 20 09:26:17 myhost postgres[25670]: [6-1] 2019-05-20 09:26:17 UTC LOG: database system was shut down at 2019-05-20 09:26
:07 UTC
May 20 09:26:17 myhost postgres[25670]: [7-1] 2019-05-20 09:26:17 UTC LOG: entering standby mode
May 20 09:26:17 myhost postgres[25670]: [8-1] 2019-05-20 09:26:17 UTC LOG: consistent recovery state reached at 0/806CF98
May 20 09:26:17 myhost postgres[25670]: [9-1] 2019-05-20 09:26:17 UTC LOG: invalid record length at 0/806CF98: wanted 24, go
t 0
May 20 09:26:17 myhost postgres[25661]: [6-1] 2019-05-20 09:26:17 UTC LOG: database system is ready to accept read only conn
ections
May 20 09:26:17 myhost postgres[25674]: [7-1] 2019-05-20 09:26:17 UTC LOG: started streaming WAL from primary at 0/8000000 o
n timeline 2
May 20 09:26:17 myhost postgres[25670]: [10-1] 2019-05-20 09:26:17 UTC LOG: invalid resource manager ID 45 at 0/806CF98
May 20 09:26:17 myhost postgres[25674]: [8-1] 2019-05-20 09:26:17 UTC FATAL: terminating walreceiver process due to administ
rator command
May 20 09:26:17 myhost postgres[25670]: [11-1] 2019-05-20 09:26:17 UTC LOG: invalid resource manager ID 45 at 0/806CF98
May 20 09:26:17 myhost postgres[25670]: [12-1] 2019-05-20 09:26:17 UTC LOG: invalid resource manager ID 45 at 0/806CF98
May 20 09:26:22 myhost postgres[25670]: [13-1] 2019-05-20 09:26:22 UTC LOG: invalid resource manager ID 45 at 0/806CF98
May 20 09:26:27 myhost postgres[25670]: [14-1] 2019-05-20 09:26:27 UTC LOG: invalid resource manager ID 45 at 0/806CF98
May 20 09:26:32 myhost postgres[25670]: [15-1] 2019-05-20 09:26:32 UTC LOG: invalid resource manager ID 45 at 0/806CF98
May 20 09:26:37 myhost postgres[25670]: [16-1] 2019-05-20 09:26:37 UTC LOG: invalid resource manager ID 45 at 0/806CF98
May 20 09:26:42 myhost postgres[25670]: [17-1] 2019-05-20 09:26:42 UTC LOG: invalid resource manager ID 45 at 0/806CF98
pg_rewind
在50.5上失败,而加入了新的主要50.4
[root@myhost user]# su - postgres -c "/usr/pgsql-11/bin/pg_rewind -D /var/lib/pgsql/11/data --source-server=\"port=5432 user=myuser host=192.168.50.4 dbname='db_name'\" --dry-run --debug"
fetched file "global/pg_control", length 8192
fetched file "pg_wal/00000002.history", length 41
Source timeline history:
Target timeline history:
1: 0/0 - 0/0
servers diverged at WAL location 0/8030178 on timeline 1
could not find previous WAL record at 0/8030178: invalid record length at 0/8030178: wanted 24, got 0
Failure, exiting