Question

我正在尝试使用kubernetes集群在GCP上部署项目。我按照https://cloud.google.com/kubernetes-engine/docs/how-to/gpus中的步骤在2xGPU节点中安装了驱动程序，它确实起作用了。查看我在节点中的容器内获得的输出：

(venv) root@frameprocessor:/opt/visualcortex/bin# nvidia-smi
Fri Feb 15 05:09:36 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.48                 Driver Version: 390.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   32C    P8     7W /  75W |      0MiB /  7611MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P4            Off  | 00000000:00:05.0 Off |                    0 |
| N/A   30C    P8     7W /  75W |      0MiB /  7611MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |

在容器中运行的程序（利用Darknet，yolo和tenserflow使用GPU）引发如下错误：

root@frameprocessor:/opt/visualcortex# source ~/miniconda/bin/activate venv && python /opt/visualcortex/bin/run_vision.py
2019-02-15 06:11:40.692718: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-02-15 06:11:40.907127: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-02-15 06:11:40.908274: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: Tesla P4 major: 6 minor: 1 memoryClockRate(GHz): 1.1135
pciBusID: 0000:00:04.0
totalMemory: 7.43GiB freeMemory: 7.31GiB
2019-02-15 06:11:40.909382: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-15 06:11:41.328257: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-15 06:11:41.328940: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-02-15 06:11:41.329272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-02-15 06:11:41.329867: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7053 MB memory) -> physical GPU (device: 0, name: Tesla P4, pci bus id: 0000:00:04.0, compute capability: 6.1)
CUDA Error: invalid device ordinal
python: ./src/cuda.c:36: check_error: Assertion `0' failed.
Aborted (core dumped)

驱动程序安装正确，但是为什么程序找不到它们？您能帮忙解决这个问题吗？

部分代码：

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID" 
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

无效的设备序号，python：./src/cuda.c:36：check_error：断言“ 0”失败

0 个答案: