实施AWS ECS健康检查的最佳方式

时间:2018-05-15 16:25:06

标签: amazon-web-services amazon-ec2 amazon-cloudwatch amazon-ecs

我正在实施ECS健康检查功能,而且我正在思考最好的方法。

现在我找到了几个解决方案:

  1. 使用AWS ECS metrics and Dimensions并检查某个指标是否值不足
  2. 使用CloudWatch警报:
  3. ECSHealthAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmDescription: Alarm for ECS StatusCheckFailed Metric
      ComparisonOperator: GreaterThanOrEqualToThreshold
      EvaluationPeriods: 2
      Statistic: Maximum
      MetricName: StatusCheckFailed
      Namespace: AWS/ECS
      Period: 30
      Threshold: 1.0
      AlarmActions:
      - !Ref AlarmTopic
      InsufficientDataActions:
      - !Ref AlarmTopic
      Dimensions:
      - Name: ClusterName
        Value: !Ref ClusterName
      - Name: ServiceName
        Value: !GetAtt service.Name
    
    1. 使用CloudWatch事件:
    2. EventRule:
      Type: "AWS::Events::Rule"
      Properties:
        Name: CloudWatchRMExtensionECSStoppedRule
        Description: "Notify when ECS container stopped"
        EventPattern:
          source: ["aws.ecs"]
          detail-type: ["ECS Task State Change", "ECS Container Instance State Change"]
          detail:
            clusterArn: [ 'clusterArn' ]
            lastStatus: [ "STOPPED" ]
            stoppedReason: [ "Essential container in task exited" ]
            group: [ 'service-group' ]
        State: "ENABLED"
        Targets:
          - Arn: !Ref ECSAlarmSNSTopic
            Id: "PublishAlarmTopic"
            InputTransformer:
              InputPathsMap:
                stopped-reason: "$.detail.stoppedReason"
              InputTemplate: '"This micro-service has been stopped with the following reason: <stopped-reason>"'
      

      请问您是否可以建议这些变体是否正确,还是有其他方法可以提高效率?谢谢你的帮助!

1 个答案:

答案 0 :(得分:0)

我无法发表评论,所以这里有一些想法。无论您是从EC2服务器级别状态检查还是从每个ECS服务任务级别寻找警报,我都不清楚您的要求。我在这里添加所有可能的选项。

  1. 我将在Auto-Scaling组下运行ECS集群EC2实例,并基于ASG CloudWatch指标,在添加/删除实例时设置SNS通知。

https://docs.aws.amazon.com/autoscaling/ec2/userguide/healthcheck.html

  1. 我们还可以将AWS ecs-agent docker容器日志也发送到CloudWatch,并基于错误或已过滤的事件获取一些SNS通知。

  2. 在启动/停止每个服务任务时,我们也可以从ECS事件流中订阅CW。参考-https://docs.aws.amazon.com/AmazonECS/latest/developerguide/cloudwatch_event_stream.html https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs_cwet.html

示例事件条目位于下面的链接– https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs_cwe_events.html

有关基于日志事件设置警报的参考。

https://medium.com/@martatatiana/insufficient-data-cloudwatch-alarm-based-on-custom-metric-filter-4e41c1f82050

明智地为每个ECS服务添加运行状况检查,并在容器运行不正常时重新启动容器。 https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definition_parameters.html#container_definition_healthcheck

也请让我知道您的想法:)。