目标:我希望在提醒状态后5分钟发出通知,之后每30分钟发出一次通知。
我使用了.count()和time功能,但这无处可去,我不想计算那个烂摊子,我无法找到方法使用户友好和可靠。
我现在使用的解决方案是让两个流具有单独的窗口。
var initialData = stream
|from()
.database(db)
.measurement(metricType)
.retentionPolicy(rPolicy)
.groupBy(group)
.where(lambda: "cpu" == 'cpu-total')
.where(lambda: "host" =~ hostFilter)
|mean(metric)
.as('initialStat')
|window()
.period(10m)
.every(5m)
.align()
var continuousData = stream
|from()
.database(db)
.measurement(metricType)
.retentionPolicy(rPolicy)
.groupBy(group)
.where(lambda: metricType == 'cpu-total')
.where(lambda: "host" =~ hostFilter)
|mean(metric)
.as('continuousStat')
|window()
.period(10m)
.every(30)
.align()
除了这个看似奇怪的事实之外,我还需要计算每个值的值,我还需要单独的|alert()
个节点。第一个节点只会通知状态变化,但第二个节点不能这样做,所以我每N分钟收到一次警报提醒。我还遇到的问题是,第一个|alert()
节点会发出OK
通知,第二个节点也会在N分钟后发送一个OK
。{/ p>
我觉得必须有更好的方法来做到这一点。我想我可以在第二个if
节点中使用|alert()
语句,而不会在OK
上发送通知,因为第一个|window
会处理该通知。在这一点上,我还没有想出如何做到这一点,但我相信这是可能的。我也不想打tickscript,我知道它不是一个完整的语言,Issue 741
完整的滴答记录在
之下// CONFIGURATION PARAMETERS
// Alerting
var emailAddress = '$EMAIL'
var pagerdutyKey = '$PD'
var slackChannel = '$SLACK'
// Static Thresholds in percent cpu steal used
var warn = 85
var crit = 95
// Dynamic thresholds in number of std deviations
var warnSig = 2.5
var critSig = 3.5
// Print INFO level (every result will be an alert)
// AlertNode.StateChangesOnly will also need to be disabled
// NOTE:
// INFO level alerts will be disregarded by the pagerduty handler, this is not configurable.
var debug = FALSE
// Datastream
// Define the data that will be acted upon
var db = 'telegraf'
var group = 'host'
var metricType = 'cpu'
var metric = 'time_steal'
var rPolicy = 'default'
// Regex used to filter on a subset of hosts
var hostFilter = /.+/
// Window
var dataPeriod = 10m
var initialFrequency = 5m
var continuousFrequency = 30m
// DATAFRAME
var initialData = stream
|from()
.database(db)
.measurement(metricType)
.retentionPolicy(rPolicy)
.groupBy(group)
.where(lambda: metricType == 'cpu-total')
.where(lambda: "host" =~ hostFilter)
|mean(metric)
.as('initialStat')
|window()
.period(dataPeriod)
.every(initialFrequency)
.align()
var continuousData = stream
|from()
.database(db)
.measurement(metricType)
.retentionPolicy(rPolicy)
.groupBy(group)
.where(lambda: metricType == 'cpu-total')
.where(lambda: "host" =~ hostFilter)
|mean(metric)
.as('continuousStat')
|window()
.period(dataPeriod)
.every(continuousFrequency)
.align()
// Calculations
var initialCalculation = initialData
|eval(lambda: sigma("initialStat"))
.as('intialSigma')
.keep()
var continuousCalculation = continuousData
|eval(lambda: sigma("continuousStat"))
.as('continuousSigma')
.keep()
// ALERT CONDITIONS
var initialCondition = initialCalculation
|alert()
.id('{{ index .Tags "host" }}')
.message('{{ .ID }} is {{ .Level }}: CPU STEAL USAGE {{ index .Fields "initialStat" }}% SHORT')
.details('this is an alert')
.stateChangesOnly()
.info(lambda: debug)
.warn(lambda: "stat" < warn OR
"sigma" > warnSig)
.crit(lambda: "stat" < crit OR
"sigma" > critSig)
var continuousCondition = continuousCalculation
|alert()
.id('{{ index .Tags "host" }}')
.message('{{ .ID }} is {{ .Level }}: CPU STEAL USAGE {{ index .Fields "continuousStat" }}% LONG')
.details('this is an alert')
.info(lambda: debug)
.warn(lambda: "stat" < warn OR
"sigma" > warnSig)
.crit(lambda: "stat" < crit OR
"sigma" > critSig)
// ACTIONS
continuousCondition
// .log('/tmp/alerts/cpu_steal_usage_alerts')
// .slack()
// .channel(slackChannel)
.email(emailAddress)
.pagerDuty()
.serviceKey(pagerdutyKey)
initialCondition
// .log('/tmp/alerts/cpu_steal_usage_alerts')
// .slack()
// .channel(slackChannel)
.email(emailAddress)
.pagerDuty()
.serviceKey(pagerdutyKey)
答案 0 :(得分:1)
显然,我可以在单个流节点中执行多个窗口。
stream
|from()
.database(db)
.measurement(metricType)
.retentionPolicy(rPolicy)
.groupBy(group)
.where(lambda: metricType == metricFilter)
.where(lambda: "host" =~ hostFilter)
|window()
.period(dataPeriod)
.every(initialFrequency)
.align()
|mean(metric)
.as('initialStat')
|window()
.period(dataPeriod)
.every(continuousFrequency)
.align()
|mean(metric)
.as('continuousStat')
尽管仍在处理OK
问题。