Question

我有三个节点的PostgreSQL集群，其中一个是主节点（192.168.50.3），两个是从节点（192.168.50.4和192.168.50.5）。在从属节点上运行以下命令以进行基本备份。

pg_basebackup -h 192.168.50.3 -D <postgres_data_dir> -U <replication_user_name> --wal-method=stream -d 'sslmode=require sslcompression=0'

上述命令一旦返回0（成功），就会创建recovery.conf文件。 recovery.conf在从属节点上如下所示：

standby_mode          = 'on'
primary_conninfo      = 'host=192.168.50.3 port=5432 user=myuser password=<password_here> sslmode=require sslcompression=0'
trigger_file = '/tmp/make_master'
recovery_target_timeline = 'latest'

现在，当我在从属节点上启动PostgreSQL服务时，复制工作正常。现在进行故障转移，关闭主服务器（192.168.50.3），然后升级从服务器（192.168.50.4），然后尝试将从服务器（192.168.50.5）指向新的主服务器（192.168.50.4）。为此，请执行以下操作：

在192.168.50.5上停止PostgreSQL
使用pg_rewind

/usr/pgsql-11/bin/pg_rewind -D <data_dir_path> --source-server="port=5432 user=<username> host=192.168.50.4"

创建recovery.conf以指向新的主服务器，如下所示：

standby_mode          = 'on'
primary_conninfo      = 'host=192.168.50.4 port=5432 user=myuser password=<password_here> sslmode=require sslcompression=0'
trigger_file = '/tmp/make_master'
recovery_target_timeline = 'latest'

启动PostgreSQL服务。启动PostgreSQL服务后，进入以下位置：

LOG: invalid resource manager ID <some_id_here>

或 PostgreSQL日志一直在说

 postgres: startup   recovering 000000060000000000

无法弄清楚这里出了什么问题。

在加入新的主服务器（192.168.50.4）之前，是否需要确保未进行从服务器（192.168.50.5）复制。
我应该首先升级从服务器（192.168.50.5），然后与新的主服务器（192.168.50.4）一起加入群集，并始终从192.168.50.4进行新备份，而不是使用pg_rewind
还有其他需要遵循的标准做法吗？

从站（192.168.50.5）的日志。做了以下内容： 1.升级192.168.50.5，然后使用pg_rewind

将群集加入192.168.50.4

May 20 09:24:24 myhost postgres[23471]: [11-1] 2019-05-20 09:24:24 UTC LOG:  received promote request
May 20 09:24:24 myhost postgres[23471]: [12-1] 2019-05-20 09:24:24 UTC LOG:  redo done at 0/8065B60
May 20 09:24:24 myhost postgres[23471]: [13-1] 2019-05-20 09:24:24 UTC LOG:  selected new timeline ID: 2
May 20 09:24:25 myhost postgres[23471]: [14-1] 2019-05-20 09:24:25 UTC LOG:  archive recovery complete
May 20 09:24:25 myhost postgres[23463]: [7-1] 2019-05-20 09:24:25 UTC LOG:  database system is ready to accept connections
May 20 09:25:35 myhost postgres[23463]: [8-1] 2019-05-20 09:25:35 UTC LOG:  received fast shutdown request
May 20 09:25:35 myhost postgres[23463]: [9-1] 2019-05-20 09:25:35 UTC LOG:  aborting any active transactions
May 20 09:25:35 myhost postgres[23650]: [8-1] 2019-05-20 09:25:35 UTC FATAL:  terminating connection due to administrator com
mand
May 20 09:25:35 myhost postgres[23463]: [10-1] 2019-05-20 09:25:35 UTC LOG:  background worker "logical replication launcher"
(PID 23635) exited with exit code 1
May 20 09:25:35 myhost postgres[23472]: [6-1] 2019-05-20 09:25:35 UTC LOG:  shutting down
May 20 09:25:35 myhost postgres[23463]: [11-1] 2019-05-20 09:25:35 UTC LOG:  database system is shut down
May 20 09:25:51 myhost postgres[25121]: [1-1] 2019-05-20 09:25:51 UTC LOG:  listening on IPv4 address "0.0.0.0", port 5432
May 20 09:25:51 myhost postgres[25121]: [2-1] 2019-05-20 09:25:51 UTC LOG:  could not create IPv6 socket for address "::": Ad
dress family not supported by protocol
May 20 09:25:51 myhost postgres[25121]: [3-1] 2019-05-20 09:25:51 UTC LOG:  listening on Unix socket "/var/run/postgresql/.s.
PGSQL.5432"
May 20 09:25:51 myhost postgres[25121]: [4-1] 2019-05-20 09:25:51 UTC LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
May 20 09:25:51 myhost postgres[25121]: [5-1] 2019-05-20 09:25:51 UTC LOG:  ending log output to stderr
May 20 09:25:51 myhost postgres[25121]: [5-2] 2019-05-20 09:25:51 UTC HINT:  Future log output will go to log destination "sy
slog".
May 20 09:25:51 myhost postgres[25129]: [6-1] 2019-05-20 09:25:51 UTC LOG:  database system was shut down at 2019-05-20 09:25
:35 UTC
May 20 09:25:51 myhost postgres[25121]: [6-1] 2019-05-20 09:25:51 UTC LOG:  database system is ready to accept connections
May 20 09:25:58 myhost postgres[25373]: [7-1] 2019-05-20 09:25:58 UTC LOG:  could not receive data from client: Connection re
set by peer
May 20 09:26:07 myhost postgres[25121]: [7-1] 2019-05-20 09:26:07 UTC LOG:  received fast shutdown request
May 20 09:26:07 myhost postgres[25121]: [8-1] 2019-05-20 09:26:07 UTC LOG:  aborting any active transactions
May 20 09:26:07 myhost postgres[25496]: [7-1] 2019-05-20 09:26:07 UTC FATAL:  terminating connection due to administrator com
mand
May 20 09:26:07 myhost postgres[25303]: [7-1] 2019-05-20 09:26:07 UTC FATAL:  terminating connection due to administrator com
mand
May 20 09:26:07 myhost postgres[25478]: [7-1] 2019-05-20 09:26:07 UTC FATAL:  terminating connection due to administrator com
mand
May 20 09:26:07 myhost postgres[25121]: [9-1] 2019-05-20 09:26:07 UTC LOG:  background worker "logical replication launcher"
(PID 25138) exited with exit code 1
May 20 09:26:07 myhost postgres[25133]: [6-1] 2019-05-20 09:26:07 UTC LOG:  shutting down
May 20 09:26:07 myhost postgres[25121]: [10-1] 2019-05-20 09:26:07 UTC LOG:  database system is shut down
May 20 09:26:17 myhost postgres[25661]: [1-1] 2019-05-20 09:26:17 UTC LOG:  listening on IPv4 address "0.0.0.0", port 5432
May 20 09:26:17 myhost postgres[25661]: [2-1] 2019-05-20 09:26:17 UTC LOG:  could not create IPv6 socket for address "::": Ad
dress family not supported by protocol
May 20 09:26:17 myhost postgres[25661]: [3-1] 2019-05-20 09:26:17 UTC LOG:  listening on Unix socket "/var/run/postgresql/.s.
PGSQL.5432"
May 20 09:26:17 myhost postgres[25661]: [4-1] 2019-05-20 09:26:17 UTC LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
May 20 09:26:17 myhost postgres[25661]: [5-1] 2019-05-20 09:26:17 UTC LOG:  ending log output to stderr
May 20 09:26:17 myhost postgres[25661]: [5-2] 2019-05-20 09:26:17 UTC HINT:  Future log output will go to log destination "sy
slog".
May 20 09:26:17 myhost postgres[25670]: [6-1] 2019-05-20 09:26:17 UTC LOG:  database system was shut down at 2019-05-20 09:26
:07 UTC
May 20 09:26:17 myhost postgres[25670]: [7-1] 2019-05-20 09:26:17 UTC LOG:  entering standby mode
May 20 09:26:17 myhost postgres[25670]: [8-1] 2019-05-20 09:26:17 UTC LOG:  consistent recovery state reached at 0/806CF98
May 20 09:26:17 myhost postgres[25670]: [9-1] 2019-05-20 09:26:17 UTC LOG:  invalid record length at 0/806CF98: wanted 24, go
t 0
May 20 09:26:17 myhost postgres[25661]: [6-1] 2019-05-20 09:26:17 UTC LOG:  database system is ready to accept read only conn
ections
May 20 09:26:17 myhost postgres[25674]: [7-1] 2019-05-20 09:26:17 UTC LOG:  started streaming WAL from primary at 0/8000000 o
n timeline 2
May 20 09:26:17 myhost postgres[25670]: [10-1] 2019-05-20 09:26:17 UTC LOG:  invalid resource manager ID 45 at 0/806CF98
May 20 09:26:17 myhost postgres[25674]: [8-1] 2019-05-20 09:26:17 UTC FATAL:  terminating walreceiver process due to administ
rator command
May 20 09:26:17 myhost postgres[25670]: [11-1] 2019-05-20 09:26:17 UTC LOG:  invalid resource manager ID 45 at 0/806CF98
May 20 09:26:17 myhost postgres[25670]: [12-1] 2019-05-20 09:26:17 UTC LOG:  invalid resource manager ID 45 at 0/806CF98
May 20 09:26:22 myhost postgres[25670]: [13-1] 2019-05-20 09:26:22 UTC LOG:  invalid resource manager ID 45 at 0/806CF98
May 20 09:26:27 myhost postgres[25670]: [14-1] 2019-05-20 09:26:27 UTC LOG:  invalid resource manager ID 45 at 0/806CF98
May 20 09:26:32 myhost postgres[25670]: [15-1] 2019-05-20 09:26:32 UTC LOG:  invalid resource manager ID 45 at 0/806CF98
May 20 09:26:37 myhost postgres[25670]: [16-1] 2019-05-20 09:26:37 UTC LOG:  invalid resource manager ID 45 at 0/806CF98
May 20 09:26:42 myhost postgres[25670]: [17-1] 2019-05-20 09:26:42 UTC LOG:  invalid resource manager ID 45 at 0/806CF98

pg_rewind在50.5上失败，而加入了新的主要50.4

[root@myhost user]# su - postgres  -c "/usr/pgsql-11/bin/pg_rewind -D /var/lib/pgsql/11/data --source-server=\"port=5432 user=myuser host=192.168.50.4 dbname='db_name'\" --dry-run --debug"
fetched file "global/pg_control", length 8192
fetched file "pg_wal/00000002.history", length 41
Source timeline history:
Target timeline history:
1: 0/0 - 0/0
servers diverged at WAL location 0/8030178 on timeline 1

could not find previous WAL record at 0/8030178: invalid record length at 0/8030178: wanted 24, got 0
Failure, exiting

故障转移后pg_basebackup和pg_rewind无法在集群中启动PostgreSQL服务

0 个答案: