我对Kubeflow GPU资源感到好奇。我正在做下面的工作。
我指定GPU资源的唯一部分是在只有一个GPU的第一个容器上。但是,事件消息告诉我0/4 nodes are available: 4 Insufficient nvidia.com/gpu
。
尽管我仅指定了1个GPU资源,但为何此作业搜索4个节点?我的解释有问题吗?提前谢谢。
仅供参考)我有3个工作节点,每个1 gpu。
apiVersion: batch/v1
kind: Job
metadata:
name: saint-train-3
annotations:
sidecar.istio.io/inject: "false"
spec:
template:
spec:
initContainers:
- name: dataloader
image: <AWS CLI Image>
command: ["/bin/sh", "-c", "aws s3 cp s3://<Kubeflow Bucket>/kubeflowdata.tar.gz /s3-data; cd /s3-data; tar -xvzf kubeflowdata.tar.gz; cd kubeflow_data; ls"]
volumeMounts:
- mountPath: /s3-data
name: s3-data
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef: {key: AWS_ACCESS_KEY_ID, name: aws-secret}
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef: {key: AWS_SECRET_ACCESS_KEY, name: aws-secret}
containers:
- name: trainer
image: <Our Model Image>
command: ["/bin/sh", "-c", "wandb login <ID>; python /opt/ml/src/main.py --base_path='/s3-data/kubeflow_data' --debug_mode='0' --project='kubeflow-test' --name='test2' --gpu=0 --num_epochs=1 --num_workers=4"]
volumeMounts:
- mountPath: /s3-data
name: s3-data
resources:
limits:
nvidia.com/gpu: "1"
- name: gpu-watcher
image: pytorch/pytorch:latest
command: ["/bin/sh", "-c", "--"]
args: [ "while true; do sleep 30; done;" ]
volumeMounts:
- mountPath: /s3-data
name: s3-data
volumes:
- name: s3-data
persistentVolumeClaim:
claimName: test-claim
restartPolicy: OnFailure
backoffLimit: 6
答案 0 :(得分:0)
0/4个节点可用:4个nvidia.com/gpu不足
这意味着您没有带有nvidia.com/gpu标签的节点