为什么此Kubernetes连盒未触发我们的自动缩放器进行扩展?

时间:2019-12-02 22:48:18

标签: kubernetes airflow

我们正在运行一个具有自动缩放器的Kubernetes集群,据我所知,它在大多数情况下都能正常运行。当我们更改给定部署的副本数超过了群集资源的数量时,自动缩放器将对其进行捕获并进行扩展。同样,如果我们需要更少的资源,我们可以缩小规模。

直到今天,我们的Airflow部署中的某些吊舱由于无法获得所需的资源而停止工作。吊舱没有触发集群扩展,而是立即失败或因试图请求或使用比可用资源更多的资源而被驱逐。请参阅以下失败容器之一的YAML输出。吊舱也永远不会显示为Pending:它们会从启动立即跳到失败状态。

在某种程度上的重试容忍度方面,我是否会缺少某些东西,从而导致Pod处于待处理状态并因此等待放大?

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/psp: eks.privileged
  creationTimestamp: "2019-12-02T22:41:19Z"
  name: ingest-customer-ff06ae4d
  namespace: airflow
  resourceVersion: "32545690"
  selfLink: /api/v1/namespaces/airflow/pods/ingest-customer-ff06ae4d
  uid: dba8b4c1-1554-11ea-ac6b-12ff56d05229
spec:
  affinity: {}
  containers:
  - args:
    - scripts/fetch_and_run.sh
    env:
    - name: COMPANY
      value: acme
    - name: ENVIRONMENT
      value: production
    - name: ELASTIC_BUCKET
      value: customer
    - name: ELASTICSEARCH_HOST
      value: <redacted>
    - name: PATH_TO_EXEC
      value: tools/storage/store_elastic.py
    - name: PYTHONWARNINGS
      value: ignore:Unverified HTTPS request
    - name: PATH_TO_REQUIREMENTS
      value: tools/requirements.txt
    - name: GIT_REPO_URL
      value: <redacted>
    - name: GIT_COMMIT
      value: <redacted>
    - name: SPARK
      value: "true"
    image: dkr.ecr.us-east-1.amazonaws.com/spark-runner:dev
    imagePullPolicy: IfNotPresent
    name: base
    resources:
      limits:
        memory: 28Gi
      requests:
        memory: 28Gi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /mnt/ssd
      name: tmp-disk
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-cgpcc
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostNetwork: true
  priority: 0
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - emptyDir: {}
    name: tmp-disk
  - name: default-token-cgpcc
    secret:
      defaultMode: 420
      secretName: default-token-cgpcc
status:
  conditions:
  - lastProbeTime: "2019-12-02T22:41:19Z"
    lastTransitionTime: "2019-12-02T22:41:19Z"
    message: '0/9 nodes are available: 9 Insufficient memory.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending
  qosClass: Burstable

0 个答案:

没有答案