在Kubernetes-client Java api中,我可以使用以下方式获得给定应用程序的可用和已部署的pod实例总数:
ApiClient defaultClient = Configuration.getDefaultApiClient();
AppsV1beta1Api apiInstance = new AppsV1beta1Api();
...
try {
AppsV1beta1DeploymentList result = apiInstance.listDeploymentForAllNamespaces(_continue, fieldSelector, includeUninitialized, labelSelector, limit, pretty, resourceVersion, timeoutSeconds, watch);
foreach(ExtensionsV1beta1Deployment extensionsDeployment : result.getItems() ) {
Map<String, String> labels = extensionsDeployment.getMetadata().getLabels();
String appName = labels.getOrDefault("app", "");
ExtensionsV1beta1DeploymentStatus status = extensionsDeployment.getStatus();
int availablePods = status.getAvailableReplicas();
int deployedPods = status.getReplicas();
if ( availablePods != deployedPods) {
// Generate an alert
}
}
} catch (ApiException e) {
System.err.println("Exception when calling AppsV1beta1Api#listDeploymentForAllNamespaces");
e.printStackTrace();
}
在上面的示例中,我正在比较availablePods
和deployedPods
,如果它们不匹配,则会生成警报。
我该如何使用Prometheus使用Alerting Rules和/或Alertmanager配置复制此逻辑,在其中检查给定应用程序或作业的可用pod实例数量,以及是否与指定数量不匹配。实例,它会触发警报吗?
指定的阈值可以总计deployedPods
,也可以来自其他配置文件或模板。
答案 0 :(得分:2)
我不知道如何对所有名称空间执行此操作,但是对于一个名称空间,它看起来像:
curl -k -s 'https://prometheus-k8s/api/v1/query?query=(sum(kube_deployment_spec_replicas%7Bnamespace%3D%22default%22%7D)%20without%20(deployment%2C%20instance%2C%20pod))%20-%20(sum(kube_deployment_status_replicas_available%7Bnamespace%3D%22default%22%7D)%20without%20(deployment%2C%20instance%2C%20pod))'
这是对默认名称空间的curl请求。
警报配置如下:
groups:
- name: example
rules:
# Alert for any instance that is unreachable for >5 minutes.
- alert: availablePods!=deployedPods
expr: (sum(kube_deployment_spec_replicas{namespace="$Name_of_namespace"}) without (deployment, instance, pod)) - (sum(kube_deployment_status_replicas_available{namespace="$Name_of_namespace"}) without (deployment, instance, pod)) != 0
for: 15m
labels:
severity: page
annotations:
summary: "availablePods are not equal deployedPods"
description: "In namespace $Name_of_namespace more than 15 minutes availablePods are not equal deployedPods. "
不要忘记将变量$Name_of_namespace
更改为要检查的名称空间名称。