Linux GPU服务器上的设备顺序错误无效?

时间:2019-06-05 10:32:22

标签: linux tensorflow deep-learning gpu

我正在尝试在可通过SSH远程访问的GPU上运行张量流代码。我使用Windows CMD到SSH,然后获得服务器的Linux终端。现在我想在服务器的GPU而不是CPU上运行代码,因此我安装了Tensorflow-GPU。我正在使用conda环境来运行python。现在,当我启动python并导入tensorflow之后,我得到了以下错误。请帮我解决这个问题?

>>> sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
2019-06-05 13:26:45.280912: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-06-05 13:26:45.309892: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3399505000 Hz
2019-06-05 13:26:45.311731: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55ffc55f7cc0 executing computations on platform Host. Devices:
2019-06-05 13:26:45.311780: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-06-05 13:26:45.315413: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-06-05 13:26:46.043595: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:3: failed initializing StreamExecutor for CUDA device ordinal 3: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY: out of memory; total memory reported: 11719409664
2019-06-05 13:26:46.044237: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55ffc84a7710 executing computations on platform CUDA. Devices:
2019-06-05 13:26:46.044297: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
2019-06-05 13:26:46.044308: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (1): GeForce GTX 1080 Ti, Compute Capability 6.1
2019-06-05 13:26:46.044322: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (2): GeForce GTX 1080 Ti, Compute Capability 6.1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/media/data_dump_1/group_9/anaconda3/envs/pradyumnaenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1570, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/media/data_dump_1/group_9/anaconda3/envs/pradyumnaenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 693, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Invalid device ordinal value (3). Valid range is [0, 2].
        while setting up XLA_GPU_JIT device number 3
>>> sess=tf.Session()
2019-06-05 13:27:09.128360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:0a:00.0
2019-06-05 13:27:09.130001: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:0b:00.0
2019-06-05 13:27:09.131300: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 2 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:41:00.0
2019-06-05 13:27:09.132226: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 3 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:42:00.0
2019-06-05 13:27:09.133262: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-06-05 13:27:09.135097: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-06-05 13:27:09.385582: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-06-05 13:27:09.486764: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-06-05 13:27:09.489289: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-06-05 13:27:10.140611: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-06-05 13:27:10.145133: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-06-05 13:27:10.159615: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1, 2, 3
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/media/data_dump_1/group_9/anaconda3/envs/pradyumnaenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1570, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/media/data_dump_1/group_9/anaconda3/envs/pradyumnaenv/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 693, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Invalid device ordinal value (3). Valid range is [0, 2].
        while setting up XLA_GPU_JIT device number 3
>>> exit()

以下是GPU详细信息-

(pradyumnaenv) cse563@falcon:~$ nvidia-smi
Wed Jun  5 15:53:05 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.54                 Driver Version: 396.54                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:0A:00.0 Off |                  N/A |
|  0%   44C    P8    19W / 250W |  10791MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:0B:00.0 Off |                  N/A |
| 22%   53C    P8    21W / 250W |  10791MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:41:00.0 Off |                  N/A |
| 31%   57C    P8    22W / 250W |    677MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:42:00.0 Off |                  N/A |
|100%   91C    P2   132W / 250W |  11105MiB / 11176MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     78117      C   python                                     10777MiB |
|    1     47128      C   python                                     10777MiB |
|    2     79606      C   ...t_nagpal/miniconda3/envs/dnn/bin/python   667MiB |
|    3     83393      C   python                                     11095MiB |
+-----------------------------------------------------------------------------+

1 个答案:

答案 0 :(得分:0)

看起来正在访问的GPU ID是3。但是,其他GPU ID(0到2)似乎是空闲的。

您可以添加以下行以使tensorflow_gpu使用这些GPU_BUS_ID。

import os

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0" #(or "1" or "2")