我们正在运行一个具有自动缩放器的Kubernetes集群,据我所知,它在大多数情况下都能正常运行。当我们更改给定部署的副本数超过了群集资源的数量时,自动缩放器将对其进行捕获并进行扩展。同样,如果我们需要更少的资源,我们可以缩小规模。
直到今天,我们的Airflow部署中的某些吊舱由于无法获得所需的资源而停止工作。吊舱没有触发集群扩展,而是立即失败或因试图请求或使用比可用资源更多的资源而被驱逐。请参阅以下失败容器之一的YAML输出。吊舱也永远不会显示为Pending
:它们会从启动立即跳到失败状态。
在某种程度上的重试容忍度方面,我是否会缺少某些东西,从而导致Pod处于待处理状态并因此等待放大?
apiVersion: v1
kind: Pod
metadata:
annotations:
kubernetes.io/psp: eks.privileged
creationTimestamp: "2019-12-02T22:41:19Z"
name: ingest-customer-ff06ae4d
namespace: airflow
resourceVersion: "32545690"
selfLink: /api/v1/namespaces/airflow/pods/ingest-customer-ff06ae4d
uid: dba8b4c1-1554-11ea-ac6b-12ff56d05229
spec:
affinity: {}
containers:
- args:
- scripts/fetch_and_run.sh
env:
- name: COMPANY
value: acme
- name: ENVIRONMENT
value: production
- name: ELASTIC_BUCKET
value: customer
- name: ELASTICSEARCH_HOST
value: <redacted>
- name: PATH_TO_EXEC
value: tools/storage/store_elastic.py
- name: PYTHONWARNINGS
value: ignore:Unverified HTTPS request
- name: PATH_TO_REQUIREMENTS
value: tools/requirements.txt
- name: GIT_REPO_URL
value: <redacted>
- name: GIT_COMMIT
value: <redacted>
- name: SPARK
value: "true"
image: dkr.ecr.us-east-1.amazonaws.com/spark-runner:dev
imagePullPolicy: IfNotPresent
name: base
resources:
limits:
memory: 28Gi
requests:
memory: 28Gi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /mnt/ssd
name: tmp-disk
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-cgpcc
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
hostNetwork: true
priority: 0
restartPolicy: Never
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- emptyDir: {}
name: tmp-disk
- name: default-token-cgpcc
secret:
defaultMode: 420
secretName: default-token-cgpcc
status:
conditions:
- lastProbeTime: "2019-12-02T22:41:19Z"
lastTransitionTime: "2019-12-02T22:41:19Z"
message: '0/9 nodes are available: 9 Insufficient memory.'
reason: Unschedulable
status: "False"
type: PodScheduled
phase: Pending
qosClass: Burstable