Question

我在GKE上用GPU节点创建了一个小型集群，如下所示：

# create cluster and CPU nodes
gcloud container clusters create clic-cluster \
    --zone us-west1-b \
    --machine-type n1-standard-1 \
    --enable-autoscaling \
    --min-nodes 1 \
    --max-nodes 3 \
    --num-nodes 2

# add GPU nodes
gcloud container node-pools create gpu-pool \
    --zone us-west1-b \
    --machine-type n1-standard-2 \
    --accelerator type=nvidia-tesla-k80,count=1 \
    --cluster clic-cluster \
    --enable-autoscaling \
    --min-nodes 1 \
    --max-nodes 2 \
    --num-nodes 1

当我提交GPU作业时，它成功地结束了在GPU节点上的运行。但是，当我提交第二份工作时，我从kubernetes那里得到了UnexpectedAdmissionError：

由于请求的设备数量，更新插件资源失败对于nvidia.com/gpu不可用。请求：1，可用：0，即意外的。

我本来希望集群启动第二个GPU节点并将作业放置在那里。知道为什么没有发生吗？我的工作规格大致如下：

apiVersion: batch/v1
kind: Job
metadata:
  name: <job_name>
spec:
  template:
    spec:
      initContainers:
      - name: decode
        image: "<decoder_image>"
        resources:
          limits:
            nvidia.com/gpu: 1
        command: [...]
     [...]     
     containers:
      - name: evaluate
        image: "<evaluation_image>"
        command: [...]

Answer 1

资源约束也需要添加到lambda2规范中：

containers

我只需要在piVersion: batch/v1 kind: Job metadata: name: <job_name> spec: template: spec: initContainers: - name: decode image: "<decoder_image>" resources: limits: nvidia.com/gpu: 1 command: [...] [...] containers: - name: evaluate image: "<evaluation_image>" resources: limits: nvidia.com/gpu: 1 command: [...]之一中使用GPU，但这似乎使调度程序感到困惑。现在，自动缩放和计划可以按预期工作。

Kubernetes自动缩放GPU节点

1 个答案: