Question

我正在AWS ECS中运行RabbitMQ Docker映像（rabbitmq：3-management）。一切正常，一切正常。

然后，我增加了一些复杂性，并使用相同的RabbitMQ创建了服务，但现在已连接到AWS Network Load Balancer（我的最终目标是创建RabbitMQ集群，因此在负载均衡器后面需要一些实例）。目标组配置了端口5672，并使用相同的端口进行健康检查。健康检查之间的 Interval 时间为30秒（最大可用时间）。阈值为5。在ECS中配置服务时，<健康检查宽限期为120秒。应该足以启动服务。发生的是，当我在几分钟后运行服务时，它被终止并重新启动：

service Rabbit-master (instance i-xxx) (port 5672) is unhealthy in target-group Rabbit-cluster-target-group due to (reason Health checks failed)

“几分钟” 表示2或5或9 ...有所不同。它不是一开始就发生的，而是过了一会儿。我也看到RabbitMQ正常工作（在日志和管理面板中）。因此正是ELB导致其重新启动。不是第一次RabbitMQ死了，然后ELB重新启动它，不是。

所以我的问题是我在做错什么，如何才能与ELB一起在ECS中实现RabbitMQ的稳定工作？使用端口5672进行健康检查的想法是否错误？但是要使用哪个端口？ 15672？

对不起，如果我没有提供足够的详细信息。我对那些在我看来相关的东西感到满意。如果您还需要其他任何内容，我们将很乐意提供详细说明。谢谢！

Answer 1

显然，问题在于使用NLB的IP配置RabbitMQ服务的安全组。这个主意并没有立即出现，因为

重新启动不是在服务运行后立即发生，而是在几次之后细节
NLB没有安全组，其ID不是很明显。

更多详细信息在这里：

https://forums.aws.amazon.com/thread.jspa?threadID=263245

和此处：

https://docs.aws.amazon.com/elasticloadbalancing/latest/network/target-group-register-targets.html#target-security-groups

Answer 2

您的健康检查网址有效吗？ ALB发生在我身上。我的案子是

例如：TargetGroup已映射到/api/profiles =>容器：4000，但是我的容器没有到服务器api/profiles的任何路由。因为 ALB没有像前Nginx一样重写路径。它正在搜寻 api/profiles在容器中的路线，而我的路线只是 /profiles。因此，我在nginx中更改了路径，然后它起作用了。

如何诊断

启用cloudwatch日志，希望您会看到真正的问题。
如果不在https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-troubleshooting.html的整个列表中浏览

Answer 3

在将服务附加到ALB时，指定运行状况检查路径或端口非常重要。

ALB不会检查响应正文，但会检查状态代码，因此唯一会返回您200状态代码为curl -I http://127.0.0.1:15672的呼叫将需要进行身份验证或404或403，而LB标记目标是不健康的。

因为15672将返回200。

此外，请验证所需的ECS任务目标组的运行状况检查，是否指向实例的正确端口。

第二个选项：此外，您可以为LB编写自定义运行状况检查，该检查将监视容器的两个端口，因为ALB检查一次仅检查一个端口，因此可以基于一个简单示例在nodejs上，因此，这意味着您必须运行简单的节点应用程序，该应用程序将同时检查两个端口并响应ALB运行状况检查。

在这种情况下，您的健康检查将为/ping，端口将为3007

下面是我们用于此类ECS任务的代码，需要检查多个端口。

   var express = require('express');
const isAllReachable = require('is-all-reachable');
var request = require('request');
var app = express();

app.get('/ping', (req, res) => {

    isAllReachable([
        // first check if all reachable
        'http://localhost:15672'
        // 'http://localhost:otherport'
    ], (err, reachable, host) => {
        //if reachable then do API request if its responding
        if (reachable) {

            console.log("Health check passed");
            console.log("checking rabbitMQ");
            request.get('http://localhost:15672/api/vhosts', {
                'auth': {
                    'user': 'guest',
                    'pass': 'guest',
                    'sendImmediately': false
                }
            }, function(error, response, body) {
                console.log({
                    "status_code": response.statusCode,
                    "body": body
                })
                if (error) {
                    console.log(error)
                    console.log("failed to get vhosts");
                    res.status(500).send('health check failed');
                } else {
                    res.status(200).send('rabbit mq is running');
                }

            })
        } else {
            console.log("health check failed. ", "This server is not reachable", err);
            res.status(500).send('health check failed. one of the port is not reachable.');
            console.log(reachable)
        }
    });
});

    app.listen(3007, () => console.log('LB custom Health check server listening on port 3007!'));

对于Rabbit监视，您可以深入探索rabbitmq monitoring.

由于运行状况检查失败，AWS ELB在几分钟之内就杀死了AWS ECS中的RabbitMQ服务

3 个答案: