我设置了一个分布式postgres数据库系统,并使用hot_standby wal_level配置了复制。
有一个中央主数据库,其中有多个副本(目前全世界有15个)用作只读实例 - 因此不需要故障转移 - 我们只想将数据同步到我们可以读取的远程位置它们。
主:
wal_level = hot_standby
max_wal_senders = 20
checkpoint_segments = 8
wal_keep_segments = 8
客户端:
wal_level = hot_standby
max_wal_senders = 3
checkpoint_segments = 8
wal_keep_segments = 8
hot_standby = on
客户端的/var/lib/postgresql/9.4/recovery.conf:
standby_mode = 'on'
primary_conninfo = 'host=<IP of master> port=5432 user=replicator password=xxxx sslmode=require'
trigger_file = '/tmp/postgresql.trigger'
复制开始 - 有些日子似乎都很好。 几天之后,主人似乎没有接受更多连接进行复制......
客户:
2017-05-04 01:16:51 UTC [9608-1] FATAL: could not connect to the primary server: FATAL: sorry, too many clients already
2017-05-04 01:16:57 UTC [10807-1] FATAL: could not connect to the primary server: FATAL: sorry, too many clients already
2017-05-04 01:17:02 UTC [12022-1] FATAL: could not connect to the primary server: FATAL: sorry, too many clients already
2017-05-04 01:17:06 UTC [13217-1] FATAL: could not connect to the primary server: FATAL: remaining connection slots are reserved for non-replication superuser connections
...
主:
然后loag充满了下面的消息 - 它永远不会恢复......
2017-05-04 08:44:14 UTC [24850-1] replicator@[unknown] ERROR: requested WAL segment 000000010000003500000014 has already been removed
2017-05-04 08:44:19 UTC [25958-1] replicator@[unknown] ERROR: requested WAL segment 000000010000003500000014 has already been removed
2017-05-04 08:44:24 UTC [27063-1] replicator@[unknown] ERROR: requested WAL segment 000000010000003500000014 has already been removed
2017-05-04 08:44:29 UTC [28144-1] replicator@[unknown] ERROR: requested WAL segment 000000010000003500000014 has already been removed
2017-05-04 08:44:34 UTC [29227-1] replicator@[unknown] ERROR: requested WAL segment 000000010000003500000014 has already been removed
2017-05-04 08:44:39 UTC [30316-1] replicator@[unknown] ERROR: requested WAL segment 000000010000003500000014 has already been removed
...
客户端:
2017-04-30 11:26:22 UTC [28474-1] LOG: started streaming WAL from primary at 35/14000000 on timeline 1
2017-04-30 11:26:22 UTC [28474-2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000003500000014 has already been removed
2017-04-30 11:26:26 UTC [29328-1] LOG: started streaming WAL from primary at 35/14000000 on timeline 1
2017-04-30 11:26:26 UTC [29328-2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000003500000014 has already been removed
2017-04-30 11:26:31 UTC [30394-1] LOG: started streaming WAL from primary at 35/14000000 on timeline 1
2017-04-30 11:26:31 UTC [30394-2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000003500000014 has already been removed
...
所以我的问题:
答案 0 :(得分:1)
https://www.postgresql.org/docs/current/static/runtime-config-replication.html:
max_wal_senders(整数)
指定备用数据库的最大并发连接数 服务器或流基备份客户端(即最大数量 同时运行WAL发送器进程)。默认值为零, 意味着禁用复制。 WAL发送者进程计入 连接总数,因此参数不能设置得更高 比max_connections。 突发的流媒体客户端可能会断开连接 导致孤立的连接槽,直到达到超时,所以这个 参数应设置略高于最大数量 预期客户所以断开连接的客户可以立即重新连接。
(强调我的)。 应用程序连接或孤立连接导致您的
FATAL: sorry, too many clients already
您可能希望为应用程序使用某些连接池,例如pgbouncer,在实际发生之前限制太多连接。
如果您have archive_command
设置为在某处实际复制WAL,请回答您的问题,修改slave上recovery.conf中的restore_command以获取它们。它将允许奴隶从失去的那一刻开始追赶。否则你必须重建它。