部署失败,因为内存不足

时间:2020-10-10 14:15:07

标签: kubernetes amazon-eks

我们在k8s上运行了Prometheus,但由于RAM需求不足(并且CPU也接近极限)而无法启动。由于这对我来说是全新的,因此我不确定采用哪种方法。我尝试以稍微增加的RAM限制来部署容器(节点具有16Gi,我从145xxMi增加到15Gi)。状态一直处于待处理状态。

  Normal   NotTriggerScaleUp  81s (x16 over 5m2s)   cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 3 node(s) didn't match node selector, 2 Insufficient memory
  Warning  FailedScheduling   80s (x6 over 5m23s)   default-scheduler   0/10 nodes are available: 10 Insufficient memory, 6 node(s) didn't match node selector, 9 Insufficient cpu.
  Normal   NotTriggerScaleUp  10s (x14 over 5m12s)  cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 Insufficient memory, 3 node(s) didn't match node selector

这些是普罗米修斯崩溃后不再启动的日志。描述pod还表示内存使用率为99%:

level=info ts=2020-10-09T09:39:34.745Z caller=head.go:632 component=tsdb msg="WAL segment loaded" segment=53476 maxSegment=53650
level=info ts=2020-10-09T09:39:38.518Z caller=head.go:632 component=tsdb msg="WAL segment loaded" segment=53477 maxSegment=53650
level=info ts=2020-10-09T09:39:41.244Z caller=head.go:632 component=tsdb msg="WAL segment loaded" segment=53478 maxSegment=53650

我该怎么做才能解决此问题?请注意,没有自动缩放功能。

我是否手动扩展EC2工作节点? 我还要做其他事情吗?

1 个答案:

答案 0 :(得分:1)

集群自动缩放器发出的消息揭示了该问题:

cluster-autoscaler pod didn't trigger scale-up

即使集群自动缩放器将新节点添加到集群中,Prometheus仍然不适合该节点。

这可能是由于EKS节点具有来自16Gi reserved for the system的某些容量。可分配的容量似乎小于15Gi,因为Prometheus在增加其内存请求后不适合该节点。

要解决此问题,您可以减少Prometheus窗格上的内存请求,或添加具有更多可用内存的新的较大节点。