第一次后未触发AWS Cloudwatch Metric警报

时间:2019-09-10 17:12:38

标签: amazon-web-services amazon-cloudformation amazon-cloudwatch amazon-cloudwatchlogs

我有一个警报,正在日志中寻找error消息,它确实触发了警报状态。但是它不会重置,并保持为In Alarm状态。我将警报操作作为SNS主题,这又触发了电子邮件。因此,基本上在出现第一个错误之后,我看不到任何后续电子邮件。以下模板配置出了什么问题?

"AppErrorMetric": {
  "Type": "AWS::Logs::MetricFilter",
  "Properties": {
    "LogGroupName": {
      "Ref": "AppServerLG"
    },
    "FilterPattern": "[error]",
    "MetricTransformations": [
      {
        "MetricValue": "1",
        "MetricNamespace": {
          "Fn::Join": [
            "",
            [
              {
                "Ref": "ApplicationEndpoint"
              },
              "/metrics/AppError"
            ]
          ]
        },
        "MetricName": "AppError"
      }
    ]
  }
},
"AppErrorAlarm": {
        "Type": "AWS::CloudWatch::Alarm",
        "Properties": {
    "ActionsEnabled": "true",
            "AlarmName": {
                "Fn::Join": [
                    "",
                    [
                        {
                            "Ref": "AppId"
                        },
                        ",",
                        {
                            "Ref": "AppServerAG"
                        },
                        ":",
                        "AppError",
                        ",",
                        "MINOR"
                    ]
                ]
            },
            "AlarmDescription": {
                "Fn::Join": [
                    "",
                    [
                        "service is throwing error. Please check logs.",
                        {
                            "Ref": "AppServerAG"
                        },
                        "-",
                        {
                            "Ref": "AppId"
                        }
                    ]
                ]
            },
            "MetricName": "AppError",
            "Namespace": {
                "Fn::Join": [
                    "",
                    [
                        {
                            "Ref": "ApplicationEndpoint"
                        },
                        "metrics/AppError"
                    ]
                ]
            },
            "Statistic": "Sum",
            "Period": "300",
            "EvaluationPeriods": "1",
            "Threshold": "1",
            "AlarmActions": [{
              "Fn::GetAtt": [
                "VPCInfo",
                "SNSTopic"
              ]
            }],
            "ComparisonOperator": "GreaterThanOrEqualToThreshold"
        }
}

1 个答案:

答案 0 :(得分:1)

您的问题是两个因素的组合:

  1. 只有在发现错误时才发出您的度量标准,这是一个稀疏的度量标准,因此错误时我将显示1,但是如果不存在错误,则不会发出零。
  2. 默认情况下,CloudWatch Alarms被配置为TreatMissingDatamissing

CloudWatch documentation about missing data说:

  

对于每个警报,您可以指定CloudWatch以处理丢失的数据   指向以下任意一项:

     
      
  • notBreaching –缺少的数据点被视为“良好”并且在阈值之内,
  •   
  • 违反–缺失的数据点被视为“不良”并违反阈值
  •   
  • 忽略-保持当前警报状态
  •   
  • 丢失–警报在评估是否更改状态时不会考虑缺少数据点
  •   

在您的警报配置中添加"TreatMissing": "notBreaching"参数将导致CloudWatch将丢失的数据点视为未破坏,并将警报转换为OK:

"AppErrorAlarm": {
        "Type": "AWS::CloudWatch::Alarm",
        "Properties": {
            "ActionsEnabled": "true",
            "AlarmName": {
                "Fn::Join": [
                    "",
                    [
                        {
                            "Ref": "AppId"
                        },
                        ",",
                        {
                            "Ref": "AppServerAG"
                        },
                        ":",
                        "AppError",
                        ",",
                        "MINOR"
                    ]
                ]
            },
            "AlarmDescription": {
                "Fn::Join": [
                    "",
                    [
                        "service is throwing error. Please check logs.",
                        {
                            "Ref": "AppServerAG"
                        },
                        "-",
                        {
                            "Ref": "AppId"
                        }
                    ]
                ]
            },
            "MetricName": "AppError",
            "Namespace": {
                "Fn::Join": [
                    "",
                    [
                        {
                            "Ref": "ApplicationEndpoint"
                        },
                        "metrics/AppError"
                    ]
                ]
            },
            "Statistic": "Sum",
            "Period": "300",
            "EvaluationPeriods": "1",
            "Threshold": "1",
            "TreatMissingData": "notBreaching",
            "AlarmActions": [{
              "Fn::GetAtt": [
                "VPCInfo",
                "SNSTopic"
              ]
            }],
            "ComparisonOperator": "GreaterThanOrEqualToThreshold"
        }
}