Question

我使用Prometheus，cAdvisor和Prometheus Alertmanager监控多个容器。我想要的是在容器出于某种原因出现故障时收到警报。问题是如果容器死了，cAdvisor没有收集任何指标。任何查询都会返回“无数据”。因为查询没有匹配项。

Answer 1

看看普罗米修斯功能缺席（）

缺席（v instant-vector）如果传递给它的向量具有任何元素，则返回空向量;如果传递给它的向量没有元素，则返回值为1的1元素向量。

这对于在给定的度量标准名称和标签组合不存在时间序列时发出警报非常有用。

的示例：

func showInputDialog() { //Creating UIAlertController and //Setting title and message for the alert dialog let alertController = UIAlertController(title: "Choose Master Password", message: "Enter your Master and confirm it!", preferredStyle: .alert) //the confirm action taking the inputs let confirmAction = UIAlertAction(title: "Enter", style: .default) { (_) in //getting the input values from user let master = alertController.textFields?[0].text let confirm = alertController.textFields?[1].text if master == confirm { self.labelCorrect.isHidden = true self.labelCorrect.text = master } } //the cancel action doing nothing let cancelAction = UIAlertAction(title: "Cancel", style: .cancel) { (_) in } //adding textfields to our dialog box alertController.addTextField { (textField) in textField.placeholder = "Enter Master" } alertController.addTextField { (textField) in textField.placeholder = "Confirm Password" } //adding the action to dialogbox alertController.addAction(confirmAction) alertController.addAction(cancelAction) //finally presenting the dialog box self.present(alertController, animated: true, completion: nil) } absent(nonexistent{job="myjob"}) => {job="myjob"} absent(nonexistent{job="myjob",instance=~".*"}) => {job="myjob"}

以下是警报的示例：

absent(sum(nonexistent{job="myjob"})) => {}

Answer 2

我使用了一个名为Docker Event Monitor的小型工具，该工具在Docker主机上作为容器运行，并在触发某些事件时将警报发送到Slack，Discord或SparkPost。您可以配置哪些事件触发警报。

Answer 3

试试这个：

 time() - container_last_seen{label="whatever-label-you-have", job="myjob"} > 60

如果在 60 秒内无法看到容器，则会发出警报。或者

absent(container_memory_usage_bytes{label="whatever-label-you-have", job="myjob"})

请注意，在第二种方法中，容器的内存使用量可能需要一段时间才能达到 0。

Answer 4

我们可以使用这两个：

absent(container_start_time_seconds{name="my-container"})

这个包含时间戳的特定指标在 5 分钟内似乎不会过时，因此它会在从上次抓取后消失后立即从 prometheus 结果中消失（请参阅：https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness），而不是像 container_cpu_usage_seconds_total 一样在 5 分钟后消失。测试正常，但我不确定我是否理解石板......

否则你可以使用这个：

time() - timestamp(container_cpu_usage_seconds_total{name="mycontainer"}) > 60 OR absent(container_cpu_usage_seconds_total{name="mycontainer"})

第一部分给出了从指标被抓取以来的时间。因此，如果它从导出器输出中消失但仍由 promql 返回（默认为 5 分钟），则此方法有效。您必须根据刮擦间隔调整 >60。

如果泊坞窗容器停止则发出警报

4 个答案: