由于管理员命令,错误的资源管理器数据校验和记录在2 / XYZ +终止walreceiver进程

时间:2016-03-02 16:07:48

标签: postgresql replication wal

我正在使用PostgreSQL 9.1(1个主服务器,3个从服务器)运行流式复制环境。 aprox一切正常。 2个月。昨天,对其中一个从服务器的复制失败,从服务器上的日志有:

LOG:  incorrect resource manager data checksum in record at 61/DA2710A7
FATAL:  terminating walreceiver process due to administrator command
LOG:  incorrect resource manager data checksum in record at 61/DA2710A7
LOG:  incorrect resource manager data checksum in record at 61/DA2710A7
LOG:  incorrect resource manager data checksum in record at 61/DA2710A7
LOG:  incorrect resource manager data checksum in record at 61/DA2710A7
LOG:  incorrect resource manager data checksum in record at 61/DA2710A7
LOG:  incorrect resource manager data checksum in record at 61/DA2710A7
LOG:  incorrect resource manager data checksum in record at 61/DA2710A7

奴隶不再与主人同步。 两小时后,其中日志每隔5秒就会获得一条新线,我重新启动了从属数据库服务器:

LOG:  incorrect resource manager data checksum in record at 61/DA2710A7
LOG:  received fast shutdown request
LOG:  aborting any active transactions
LOG:  incorrect resource manager data checksum in record at 61/DA2710A7
FATAL:  terminating connection due to administrator command
FATAL:  terminating connection due to administrator command
LOG:  shutting down
LOG:  database system is shut down

从站上的新日志文件包含:

LOG:  database system was shut down in recovery at 2016-02-29 05:12:11 CET
LOG:  entering standby mode
LOG:  redo starts at 61/D92C10C9
LOG:  consistent recovery state reached at  61/DA2710A7
LOG:  database system is ready to accept read only connections
LOG:  incorrect resource manager data checksum in record at 61/DA2710A7
LOG:  streaming replication successfully connected to primary

现在奴隶与主人同步,但校验和条目仍在那里。我检查的另一件事是网络日志 - >网络可用。

我的问题是:

  1. 有谁知道为什么walreceiver被终止了?
  2. 为什么PostgreSQL没有重试复制?
  3. 将来我该怎样做才能防止这种情况发生?
  4. 谢谢。

    修改

    数据库服务器在带有ext3的SLES 11上运行。我发现了一篇关于SLES 11具有大RAM的低性能的文章,但我不确定它是否适用,因为我的机器只有8 GB RAM(https://www.novell.com/support/kb/doc.php?id=7010287

    任何帮助都将不胜感激。

    编辑(2):

    PostgreSQL版本是9.1.5。似乎PostgreSQL版本9.1.6提供了类似问题的修复程序?

    Fix persistence marking of shared buffers during WAL replay (Jeff Davis)
    
    This mistake can result in buffers not being written out during checkpoints, resulting in data corruption if the server later crashes without ever having written those buffers. Corruption can occur on any server following crash recovery, but it is significantly more likely to occur on standby slave servers since those perform much more WAL replay.
    

    来源:http://www.postgresql.org/docs/9.1/static/release-9-1-6.html

    这可能是解决方法吗?我应该升级到PostgreSQL 9.1.6并且一切都会顺利运行吗?

2 个答案:

答案 0 :(得分:0)

如果有人遇到这个问题,我最终会从备份数据重新安装数据库并再次设置复制。从来没有真正弄清楚出了什么问题。

答案 1 :(得分:0)

  

从未真正弄清楚出了什么问题。

我遇到了同样的错误 - 只是从一开始就没有完全同步。

然后,主服务器出现了一些内核错误(服务器机箱出现热量问题?)。由于关闭不完整,需要关闭服务器。在关闭时,奴隶出现了

LOG:  incorrect resource manager data checksum in record at 1/63663CB0

重新启动主服务器并重新启动从属服务器后,情况不会改变:每5秒都有相同的日志条目。