我们在k8s上运行了Prometheus,但由于RAM需求不足(并且CPU也接近极限)而无法启动。由于这对我来说是全新的,因此我不确定采用哪种方法。我尝试以稍微增加的RAM限制来部署容器(节点具有16Gi,我从145xxMi增加到15Gi)。状态一直处于待处理状态。
Normal NotTriggerScaleUp 81s (x16 over 5m2s) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 3 node(s) didn't match node selector, 2 Insufficient memory
Warning FailedScheduling 80s (x6 over 5m23s) default-scheduler 0/10 nodes are available: 10 Insufficient memory, 6 node(s) didn't match node selector, 9 Insufficient cpu.
Normal NotTriggerScaleUp 10s (x14 over 5m12s) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 Insufficient memory, 3 node(s) didn't match node selector
这些是普罗米修斯崩溃后不再启动的日志。描述pod还表示内存使用率为99%:
level=info ts=2020-10-09T09:39:34.745Z caller=head.go:632 component=tsdb msg="WAL segment loaded" segment=53476 maxSegment=53650
level=info ts=2020-10-09T09:39:38.518Z caller=head.go:632 component=tsdb msg="WAL segment loaded" segment=53477 maxSegment=53650
level=info ts=2020-10-09T09:39:41.244Z caller=head.go:632 component=tsdb msg="WAL segment loaded" segment=53478 maxSegment=53650
我该怎么做才能解决此问题?请注意,没有自动缩放功能。
我是否手动扩展EC2工作节点? 我还要做其他事情吗?
答案 0 :(得分:1)
集群自动缩放器发出的消息揭示了该问题:
cluster-autoscaler pod didn't trigger scale-up
即使集群自动缩放器将新节点添加到集群中,Prometheus仍然不适合该节点。
这可能是由于EKS节点具有来自16Gi reserved for the system的某些容量。可分配的容量似乎小于15Gi,因为Prometheus在增加其内存请求后不适合该节点。
要解决此问题,您可以减少Prometheus窗格上的内存请求,或添加具有更多可用内存的新的较大节点。