Question

我正在运行带有基础映像nvidia/cuda:10.2-cudnn7-devel-ubuntu18.04的docker映像。使用conda安装tensorflow-gpu 1.14.0，但由于某种原因，通过conda安装不会使用GPU，甚至无法检测到Cuda。

信息：

$ nvcc --version
                                                                           
│# Configure environment
nvcc: NVIDIA (R) Cuda compiler driver                                                                                 │ENV CONDA_DIR=/opt/conda \
Copyright (c) 2005-2019 NVIDIA Corporation                                                                            │    SHELL=/bin/bash \
Built on Wed_Oct_23_19:24:38_PDT_2019                                                                                 │    NB_USER=$NB_USER \
Cuda compilation tools, release 10.2, V10.2.89

$ nvidia-smi 

features (e.g., download as all possible file formats)
Fri Nov  6 06:42:31 2020                                                                                              │ENV DEBIAN_FRONTEND noninteractive
+-----------------------------------------------------------------------------+                                       │RUN apt-get update \
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |                                       │ && apt-get install -yq --no-install-recommends \
|-------------------------------+----------------------+----------------------+                                       │    wget \
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |                                       │    bzip2 \
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |                                       │    ca-certificates \
|                               |                      |               MIG M. |                                       │    sudo \
|===============================+======================+======================|                                       │    locales \
|   0  Tesla T4            Off  | 00000000:5E:00.0 Off |                    0 |                                       │    fonts-liberation \
| N/A   30C    P8    15W /  70W |      0MiB / 15109MiB |      0%      Default |                                       │    run-one \
|                               |                      |                  N/A |                                       │ && apt-get clean && rm -rf /var/lib/apt/lists/*
+-------------------------------+----------------------+----------------------+                                       │
                                                                                                                      │RUN echo "en_US.UTF-8 UTF-8" > /etc/locale.gen && \
+-----------------------------------------------------------------------------+                                       │    locale-gen
| Processes:                                                                  |                                       │
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |                                       │# Configure environment
|        ID   ID                                                   Usage      |                                       │ENV CONDA_DIR=/opt/conda \
|=============================================================================|                                       │    SHELL=/bin/bash \
|  No running processes found                                                 |                                       │    NB_USER=$NB_USER \
+-----------------------------------------------------------------------------+

$ conda list

...
tensorflow                1.14.0               h4531e10_0    conda-forge                                              │
tensorflow-base           1.14.0           py37h4531e10_0    conda-forge                                              │# Install all OS dependencies for notebook server that starts but lacks all
tensorflow-estimator      1.14.0           py37h5ca1d4c_0    conda-forge                                              │# features (e.g., download as all possible file formats)
tensorflow-gpu            1.14.0               h0d30ee6_0    defaults
...

输入Python环境

import tensorflow as tf
/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])

 tf.__version__
'1.14.0'

 tf.test.is_built_with_cuda()
False

tf.test.is_gpu_available()
2020-11-06 06:53:29.829226: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
2020-11-06 06:53:29.850287: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2500000000 Hz
2020-11-06 06:53:29.858346: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x556b4f1a0560 executing computations on platform Host. Devices:
2020-11-06 06:53:29.858397: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
False

但是，使用pip安装tensorflow-gpu效果很好（仅出于演示目的，我的要求需要使用conda）

 tf.__version__
'1.14.0'

tf.test.is_built_with_cuda()
True

tf.test.is_gpu_available()
2020-11-06 07:01:14.032230: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-11-06 07:01:14.059798: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2500000000 Hz
2020-11-06 07:01:14.067201: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55b33b4840f0 executing computations on platform Host. Devices:
2020-11-06 07:01:14.067240: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2020-11-06 07:01:14.068406: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2020-11-06 07:01:14.089496: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:5e:00.0
2020-11-06 07:01:14.089732: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-11-06 07:01:14.089856: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-11-06 07:01:14.089947: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-11-06 07:01:14.090037: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-11-06 07:01:14.090131: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-11-06 07:01:14.090220: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-11-06 07:01:14.094573: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2020-11-06 07:01:14.094606: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
2020-11-06 07:01:14.219019: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-11-06 07:01:14.219061: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0
2020-11-06 07:01:14.219081: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N
2020-11-06 07:01:14.223660: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55b33e64f7c0 executing computations on platform CUDA. Devices:
2020-11-06 07:01:14.223692: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
False

使用conda安装tensorflow-gpu时，Python环境不使用GPU

0 个答案: