我有一个具有以下特征的Cloudwatch警报:
描述:exchange_failure> = 3在5分钟内获得1个数据点
期间:300
统计:总和
处理丢失的数据:丢失
exchange_failure度量标准很稀疏,这意味着某些期间将没有数据,而另一些期间将没有数据。
在高警报活动和低警报活动期间,警报将在OK(正常)和ALARM(警报)状态之间快速转换,这比我从图表中看到的更多。具体来说,在00:45到00:00之间,有6个状态转换,而我希望它是0。
我查看了状态更改历史记录,发现警报正在将评估范围从一分钟更改为下一分钟,从而导致快速转换。
# evaluation range - 15 minutes
# 2019-05-12 00:58:00 alarm -> ok
"newState": {
"stateValue": "OK",
"stateReason": "Threshold Crossed: 1 out of the last 1 datapoints [1.0 (12/05/19 00:43:00)] was not greater than or equal to the threshold (3.0) (minimum 1 datapoint for ALARM -> OK transition).",
"stateReasonData": {
"version": "1.0",
"queryDate": "2019-05-12T00:58:10.050+0000",
"startDate": "2019-05-12T00:43:00.000+0000",
"statistic": "Sum",
"period": 300,
"recentDatapoints": [
1
],
"threshold": 3
}
}
# 2019-05-12 00:54:00 ok-> alarm
# evaluation range - 15 minutes
{
"newState": {
"stateValue": "ALARM",
"stateReason": "Threshold Crossed: 1 out of the last 1 datapoints [6.0 (12/05/19 00:39:00)] was greater than or equal to the threshold (3.0) (minimum 1 datapoint for OK -> ALARM transition).",
"stateReasonData": {
"version": "1.0",
"queryDate": "2019-05-12T00:54:10.027+0000",
"startDate": "2019-05-12T00:39:00.000+0000",
"statistic": "Sum",
"period": 300,
"recentDatapoints": [
6
],
"threshold": 3
}
}
}
# 2019-05-12 00:53:00 alarm -> ok
# evaluation range - 6 minutes
{
"newState": {
"stateValue": "OK",
"stateReason": "Threshold Crossed: 1 out of the last 1 datapoints [1.0 (12/05/19 00:43:00)] was not greater than or equal to the threshold (3.0) (minimum 1 datapoint for ALARM -> OK transition).",
"stateReasonData": {
"version": "1.0",
"queryDate": "2019-05-12T00:53:10.026+0000",
"startDate": "2019-05-12T00:43:00.000+0000",
"statistic": "Sum",
"period": 300,
"recentDatapoints": [
1
],
"threshold": 3
}
}
}
# 2019-05-12 00:48:00 ok -> alarm
# evaluation range - 10 minutes
{
"newState": {
"stateValue": "ALARM",
"stateReason": "Threshold Crossed: 1 out of the last 1 datapoints [6.0 (12/05/19 00:39:00)] was greater than or equal to the threshold (3.0) (minimum 1 datapoint for OK -> ALARM transition).",
"stateReasonData": {
"version": "1.0",
"queryDate": "2019-05-12T00:49:10.026+0000",
"startDate": "2019-05-12T00:39:00.000+0000",
"statistic": "Sum",
"period": 300,
"recentDatapoints": [
6
],
"threshold": 3
}
}
}
# 2019-05-12 00:48:00 alarm -> ok
# evaluation range - 5 minutes
{
"newState": {
"stateValue": "OK",
"stateReason": "Threshold Crossed: 1 out of the last 1 datapoints [1.0 (12/05/19 00:43:00)] was not greater than or equal to the threshold (3.0) (minimum 1 datapoint for ALARM -> OK transition).",
"stateReasonData": {
"version": "1.0",
"queryDate": "2019-05-12T00:48:10.027+0000",
"startDate": "2019-05-12T00:43:00.000+0000",
"statistic": "Sum",
"period": 300,
"recentDatapoints": [
1
],
"threshold": 3
}
}
}
# 2019-05-12 00:43:00 ok -> alarm
# evaluation range - 5 minutes
{
"newState": {
"stateValue": "ALARM",
"stateReason": "Threshold Crossed: 1 out of the last 1 datapoints [5.0 (12/05/19 00:38:00)] was greater than or equal to the threshold (3.0) (minimum 1 datapoint for OK -> ALARM transition).",
"stateReasonData": {
"version": "1.0",
"queryDate": "2019-05-12T00:43:10.042+0000",
"startDate": "2019-05-12T00:38:00.000+0000",
"statistic": "Sum",
"period": 300,
"recentDatapoints": [
5
],
"threshold": 3
}
}
}
在https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html中,文档指出了一些事项: “如果缺少评估范围内的某些数据点,但是检索到的现有数据点的数量等于或大于警报的评估周期,则CloudWatch将根据成功检索到的最新现有数据点来评估警报状态。在在这种情况下,不需要为您设置如何处理丢失的数据的值,并且该值将被忽略。”
“这种行为的一个特殊情况是,CloudWatch警报可能会在指标停止流动后的一段时间内重复重新评估最后一组数据点。这种重新评估可能会导致警报更改状态并重新-执行操作(如果在紧接度量标准流之前状态已更改)。为减轻此行为,请使用较短的时间。”
根据文档,有可能正在进行重新评估。我不明白的是为什么评估范围变化如此之大,以及如何避免这种变化。有选择吗?我不希望使用较短的时间段,因为我想捕获一个实例,该实例连续几分钟,每分钟有1个exchange_failure。用1分钟会错过这一点。
一种选择是将期限从5分钟延长到15分钟。在这种情况下,我希望状态更改不会那么频繁。