使用nrpe的自定义nagios脚本导致非零退出状态1

时间:2015-07-24 21:34:36

标签: python rabbitmq nagios nrpe rabbitmqctl

我正在尝试使用nrpe运行python脚本来监视rabbitmq。在脚本里面是一个命令' sudo rabbiqmqctl list_queues'这给了我每个队列的消息计数。然而,这导致了nagios给出了htis消息:

CRITICAL - Command '['sudo', 'rabbitmqctl', 'list_queues']' returned non-zero exit status 1 

我认为这可能是权限问题,因此按以下方式进行

的/ etc /组:

ec2-user:x:500:
rabbitmq:x:498:nrpe,nagios,ec2-user
nagios:x:497:
nrpe:x:496:
rpc:x:32:

的/ etc / sudoers的:

%rabbitmq ALL=NOPASSWD: /usr/sbin/rabbitmqctl

nagios配置:

command[check_rabbitmq_queuecount_prod]=/usr/bin/python27 /etc/nagios/check_rabbitmq_prod -a queues_count -C 3000 -W 1500

check_rabbitmq_prod:



#!/usr/bin/env python
from optparse import OptionParser
import shlex
import subprocess
import sys

class RabbitCmdWrapper(object):
    """So basically this just runs rabbitmqctl commands and returns parsed output.
       Typically this means you need root privs for this to work.

       Made this it's own class so it could be used in other monitoring tools
       if desired."""

    @classmethod
    def list_queues(cls):
        args = shlex.split('sudo rabbitmqctl list_queues')
        cmd_result = subprocess.check_output(args).strip()
        results = cls._parse_list_results(cmd_result)
        return results

    @classmethod
    def _parse_list_results(cls, result_string):
        results = result_string.strip().split('\n')
        #remove text fluff
        results.remove(results[-1])
        results.remove(results[0])
        return_data = []
        for row in results:
            return_data.append(row.split('\t'))
        return return_data

def check_queues_count(critical=1000, warning=1000):
    """
    A blanket check to make sure all queues are within count parameters.
    TODO: Possibly break this out so test can be done on individual queues.
    """
    try:
        critical_q = []
        warning_q = []
        ok_q = []
        results = RabbitCmdWrapper.list_queues()

        for queue in results:
            if queue[0] == 'SFS_Production_Queue':
                count = int(queue[1])
                if count >= critical:
                        critical_q.append("%s: %s" % (queue[0], count))
                elif count >= warning:
                        warning_q.append("%s: %s" % (queue[0], count))
                else:
                        ok_q.append("%s: %s" % (queue[0], count))
        if critical_q:
            print "CRITICAL - %s" % ", ".join(critical_q)
            sys.exit(2)
        elif warning_q:
            print "WARNING - %s" % ", ".join(warning_q)
            sys.exit(1)
        else:
            print "OK - %s" % ", ".join(ok_q)
            sys.exit(0)
    except Exception, err:
        print "CRITICAL - %s" % err
        sys.exit(2)

USAGE = """Usage: ./check_rabbitmq -a [action] -C [critical] -W [warning]
           Actions:
           - queues_count
             checks the count in each of the queues in rabbitmq's list_queues"""

if __name__ == "__main__":
    parser = OptionParser(USAGE)
    parser.add_option("-a", "--action", dest="action",
                      help="Action to Check")
    parser.add_option("-C", "--critical", dest="critical",
                      type="int", help="Critical Threshold")
    parser.add_option("-W", "--warning", dest="warning",
                      type="int", help="Warning Threshold")
    (options, args) = parser.parse_args()

    if options.action == "queues_count":
        check_queues_count(options.critical, options.warning)
    else:
        print "Invalid action: %s" % options.action
        print USAGE




此时我不确定是什么阻止脚本运行。它通过命令行运行良好。任何帮助表示赞赏。

2 个答案:

答案 0 :(得分:3)

“非零退出代码”错误通常与您的sudoers文件中默认应用于所有用户的requiretty相关联。

在sudoers文件中为运行检查的用户禁用“requiretty”是安全的,并且可能会解决问题。

E.g。 (假设nagios / nrpe是用户)

@ / etc / sudoers

Defaults:nagios !requiretty
Defaults:nrpe !requiretty

答案 1 :(得分:1)

我想@ EE1213先生提到的是正确的。如果您有权查看/ var / log / secure,则日志可能包含有关sudoers的错误消息。像:

"sorry, you must have a tty to run sudo"