我正在尝试使用kubernetes集群在GCP上部署项目。 我按照https://cloud.google.com/kubernetes-engine/docs/how-to/gpus中的步骤在2xGPU节点中安装了驱动程序,它确实起作用了。查看我在节点中的容器内获得的输出:
(venv) root@frameprocessor:/opt/visualcortex/bin# nvidia-smi
Fri Feb 15 05:09:36 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.48 Driver Version: 390.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P4 Off | 00000000:00:04.0 Off | 0 |
| N/A 32C P8 7W / 75W | 0MiB / 7611MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P4 Off | 00000000:00:05.0 Off | 0 |
| N/A 30C P8 7W / 75W | 0MiB / 7611MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
在容器中运行的程序(利用Darknet,yolo和tenserflow使用GPU)引发如下错误:
root@frameprocessor:/opt/visualcortex# source ~/miniconda/bin/activate venv && python /opt/visualcortex/bin/run_vision.py
2019-02-15 06:11:40.692718: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-02-15 06:11:40.907127: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-02-15 06:11:40.908274: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla P4 major: 6 minor: 1 memoryClockRate(GHz): 1.1135
pciBusID: 0000:00:04.0
totalMemory: 7.43GiB freeMemory: 7.31GiB
2019-02-15 06:11:40.909382: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-15 06:11:41.328257: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-15 06:11:41.328940: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-15 06:11:41.329272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-15 06:11:41.329867: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7053 MB memory) -> physical GPU (device: 0, name: Tesla P4, pci bus id: 0000:00:04.0, compute capability: 6.1)
CUDA Error: invalid device ordinal
python: ./src/cuda.c:36: check_error: Assertion `0' failed.
Aborted (core dumped)
驱动程序安装正确,但是为什么程序找不到它们? 您能帮忙解决这个问题吗?
部分代码:
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"