Question

我正在尝试在kubernetes上部署一个自动化的高可用性PostgreSQL集群。如果发生主服务器故障转移或主服务器出现临时故障，备用服务器将丢失流复制连接，而在重试时，备用服务器会花费很长时间才能失败并重试。

我使用PostgreSQL 10和流复制（cluster-main-cluster-master-service是始终路由到主服务器的服务，所有副本都连接到该服务进行复制）。我尝试在connect_timeout的{{1}}中设置keepalive和primary_conninfo，在待机状态recovery.conf中设置wal_receiver_timeout这样的配置，但是我无法进行任何设置与他们一起进步。

首先，当主服务器出现故障时，复制会因以下错误（状态1）而停止：

postgresql.conf

在调查Postgres活动之后，我发现WalReceiver进程卡在2019-10-06 14:14:54.042 +0330 [3039] LOG: replication terminated by primary server 2019-10-06 14:14:54.042 +0330 [3039] DETAIL: End of WAL reached on timeline 17 at 0/33000098. 2019-10-06 14:14:54.042 +0330 [3039] FATAL: could not send end-of-streaming message to primary: no COPY in progress 2019-10-06 14:14:55.534 +0330 [12] LOG: record with incorrect prev-link 0/2D000028 at 0/33000098 wait_event（状态2）中，但是超时时间比我配置的要长（尽管我将LibPQWalReceiverConnect设置为10秒，但是大约需要2分钟）。然后，它失败并显示以下错误（状态3）：

connect_timeout

在下一次尝试中，它成功连接了主服务器（状态4）：

2019-10-06 14:17:06.035 +0330 [3264] FATAL:  could not connect to the primary server: could not connect to server: Connection timed out
        Is the server running on host "cluster-main-cluster-master-service" (192.168.0.166) and accepting
        TCP/IP connections on port 5432?

我还尝试在发生卡住事件（状态2）时终止该进程，当我这样做时，它将再次启动该进程并进行连接，然后正常进行流传输（跳转到状态4）。

检查netstat后，我还发现在walreceiver进程中（以故障转移为例）与旧主机之间存在2019-10-06 14:17:07.892 +0330 [5786] LOG: started streaming WAL from primary at 0/33000000 on timeline 17状态的连接。

Answer 1

connect_timeout控制PostgreSQL等待复制连接成功的时间，但这不包括建立TCP连接。

要减少内核等待成功应答TCP SYN请求的时间，请减少重试次数。在/etc/sysctl.conf中，设置：

net.ipv4.tcp_syn_retries = 3

并运行sysctl -p。

那应该大大减少时间。

降低该值可能会使您的系统不稳定。

Postgresql WalReceiver进程等待连接主服务器，而不考虑“ connect_timeout”

1 个答案: