Question

最近，我们遇到了非生产和生产集群的问题，其中节点遇到“系统OOM遇到”问题。

非生产群集中的节点似乎没有共享pod。看起来某个节点正在运行所有pod并对系统施加负载。

此外，Pod仍然处于这种状态 - 'Waiting：ContainerCreating'

对于上述问题的任何帮助/指导将不胜感激。我们正在这个集群中构建越来越多的服务，并希望确保没有不稳定性和/或环境问题，并在我们上线之前进行适当的检查/配置

Answer 1

"I would recommend you manage container compute resources properly within your Kubernetes cluster. When creating a Pod, you can optionally specify how much CPU and memory (RAM) each Container needs to avoid OOM situations.

When Containers have resource requests specified, the scheduler can make better decisions about which nodes to place Pods on. And when Containers have their limits specified, contention for resources on a node can be handled in a specified manner. CPU specifications are in units of cores, and memory is specified in units of bytes.

An event is produced each time the scheduler fails, use the command below to see the status of events:

$ kubectl describe pod <pod-name>| grep Events

Also, read the official Kubernetes guide on “Configure Out Of Resource Handling”. Always make sure to:

reserve 10-20% of memory capacity for system daemons like kubelet and OS kernel identify pods which can be evicted at 90-95% memory utilization to reduce thrashing and incidence of system OOM.

To facilitate this kind of scenario, the kubelet would be launched with options like below:

--eviction-hard=memory.available<xMi
--system-reserved=memory=yGi

Replacing x and y with actual memory values.

Having Heapster container monitoring in place should be helpful for visualization".

Read more reading on Kubernetes and Docker Administration

Answer 2

Unable to mount volumes for pod "xxx-3615518044-6l1cf_xxx-qa(8a5d9893-230b-11e8-a943-000d3a35d8f4)": timeout expired waiting for volumes to attach/mount for pod "xxx-service-3615518044-6l1cf"/"xxx-qa"

That indicates your pod is having trouble mounting the volume specified in your configuration. This can often be a permissions issue. If you post your config files (like to a gist) with private info removed, we could probably be more helpful.

Kubernetes集群似乎不稳定

2 个答案: