keepalived脚本使故障转移变得疯狂

时间:2015-03-08 19:38:56

标签: python python-2.7

会发生什么? 当我开始keepalived一切正常。当node01失败并且它无法再启动postgresql时,它将不断尝试强制进行选举。即使postgresql无法启动。选举现在每秒都会发生。

我想要实现的目标 它应该检查当node02是主节点时是否可以在node01上启动postgresql,但是不能一直强制选举。有人可以尝试帮助并使其正确吗?

这是我的代码

停止pgsql的:

#!/usr/bin/python

import sys
import subprocess

sys.exit(
    subprocess.call(['/usr/bin/systemctl', 'stop', 'postgresql.service'])
)

通知:

#!/usr/bin/python

import sys
import subprocess

state = sys.argv[3]

with open('/var/run/keepalived.pgsql.state', 'w+') as f:
    f.write(state)

if state == 'MASTER':
    sys.exit(
        subprocess.call(['/usr/bin/systemctl', 'start', 'postgresql.service'])
    )

if state == 'BACKUP':
    sys.exit(
        subprocess.call(['/usr/bin/systemctl', 'stop', 'postgresql.service'])
    )

if state == 'FAULT':
    sys.exit(
        subprocess.call(['/usr/bin/systemctl', 'stop', 'postgresql.service'])
    )

签的pgsql:

#!/usr/bin/python

import sys
import subprocess
from time import sleep

sleep(1)

with open('/var/run/keepalived.pgsql.state', 'r') as f:
    state = f.read().strip().strip("\n")

# status 0: Postgresql is running
# status 3: Postgresql has been stopped
status = subprocess.call(['/usr/bin/systemctl', 'status', 'postgresql.service'])

if status == 0 and state == 'MASTER':
    sys.exit(0)

if status == 0 and state == 'BACKUP':
    sys.exit(3)

if status == 3 and state == 'MASTER':
    sys.exit(3)

if status == 3 and state == 'BACKUP':
    sys.exit(0)

keepalived config:

vrrp_script chk_pgsql {
  script       "/etc/keepalived/check-pgsql"
  interval 1
  fall 3
  rise 3
  weight -4
}

vrrp_instance pgsql_vip {
    state EQUAL
    interface eth0
    virtual_router_id 4
    priority 100(node01)|99{node02}
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass 1111
    }
    track_script {
        chk_pgsql
    }
    virtual_ipaddress {
        192.168.1.20
    }
    notify "/etc/keepalived/notify"
    notify_stop "/etc/keepalived/stop"
}

1 个答案:

答案 0 :(得分:0)

node01死后,node02获得当选主服务器。然后,check01将检查node01。脚本看到node01现在处于BACKUP状态并且posgresql已停止,并返回0.在检查脚本返回0 3次后(根据您的VRRP配置),node01认为它是正常的。然后,由于node01具有比node02更高的优先级,因此它通过选举过程来控制。然后检查脚本失败,因为node01处于MASTER状态并且posgresql已停止。这会导致keepalived在节点之间开始抖动。

我认为您可以通过以下两种方式解决此问题:

  1. 使node01和node02具有相同的优先级
  2. 将您的检查脚本更改为仅返回posgresql的状态