Question

我想使用docker 19.03及更高版本才能获得GPU支持。我的系统中目前有docker 19.03.12。我可以运行以下命令来检查Nvidia驱动程序是否正在运行：

docker run -it --rm --gpus all ubuntu nvidia-smi
Wed Jul  1 14:25:55 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.64       Driver Version: 430.64       CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 107...  Off  | 00000000:01:00.0 Off |                  N/A |
| 26%   54C    P5    13W / 180W |    734MiB /  8119MiB |     39%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

此外，如果在本地运行，我的模块也可以使用GPU支持。但是，如果我构建一个docker映像并尝试运行它，则会收到一条消息：

ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

我正在使用带有tensorflow 1.12.0的cuda 9.0，但是我可以使用带有tensorflow 1.15的cuda 10.0。
就我所知，问题是我可能正在使用以前的dockerfile版本，并且该命令的版本与新的启用docker GPU的版本（19.03及更高版本）不兼容。
实际命令如下：

FROM nvidia/cuda:9.0-base-ubuntu16.04

# Pick up some TF dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
        build-essential \
        cuda-command-line-tools-9-0 \
        cuda-cublas-9-0 \
        cuda-cufft-9-0 \
        cuda-curand-9-0 \
        cuda-cusolver-9-0 \
        cuda-cusparse-9-0 \
        libcudnn7=7.0.5.15-1+cuda9.0 \
        libnccl2=2.2.13-1+cuda9.0 \
        libfreetype6-dev \
        libhdf5-serial-dev \
        libpng12-dev \
        libzmq3-dev \
        pkg-config \
        software-properties-common \
        unzip \
        && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

RUN apt-get update && \
        apt-get install nvinfer-runtime-trt-repo-ubuntu1604-4.0.1-ga-cuda9.0 && \
        apt-get update && \
        apt-get install libnvinfer4=4.1.2-1+cuda9.0

我也找不到用于基本GPU使用的docker基本文件。

在this answer中，有一个建议公开libcuda.so.1，但在我的情况下不起作用。

那么，有没有解决此问题的方法或要调整的基本dockerfile？

我的系统是Ubuntu 16.04。

编辑：

我只是注意到docker内部的nvidia-smi不显示任何cuda版本：

CUDA Version: N/A

与本地运行的相反。因此，这可能意味着我出于某种原因没有将cuda加载到docker内部。

Answer 1

tldr;

似乎可以与docker 19.03+和cuda 10一起使用的基本Dockerfile是这样的：

FROM nvidia/cuda:10.0-base

可以与tf 1.14结合使用，但由于某种原因找不到tf 1.15。

我只是使用此Dockerfile对其进行了测试：

FROM nvidia/cuda:10.0-base
CMD nvidia-smi

更长的答案：

好吧，经过大量的试验和错误（和挫折），我设法使其适用于docker 19.03.12 + cuda 10（尽管tf 1.14而不是1.15）。

我使用了this post中的代码，并使用了那里提供的基本Dockerfile。

首先，我尝试使用Dockerfile从Docker内部检查nvidia-smi：

FROM nvidia/cuda:10.0-base
CMD nvidia-smi

$docker build -t gpu_test .
...
$docker run -it --gpus all gpu_test
Fri Jul  3 07:31:05 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.64       Driver Version: 430.64       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 107...  Off  | 00000000:01:00.0 Off |                  N/A |
| 45%   65C    P2   142W / 180W |   8051MiB /  8119MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

最终似乎找到了cuda二进制文件：CUDA Version: 10.1。

然后，我制作了一个最小的Dockerfile，可以测试docker中的tensorflow二进制库的成功加载：

FROM nvidia/cuda:10.0-base

# The following are just declaring variables and ultimately use
ARG USE_PYTHON_3_NOT_2=True
ARG _PY_SUFFIX=${USE_PYTHON_3_NOT_2:+3}
ARG PYTHON=python${_PY_SUFFIX}
ARG PIP=pip${_PY_SUFFIX}

RUN apt-get update && apt-get install -y \
    ${PYTHON} \
    ${PYTHON}-pip

RUN ${PIP} install tensorflow_gpu==1.14.0

COPY bashrc /etc/bash.bashrc
RUN chmod a+rwx /etc/bash.bashrc

WORKDIR /src
COPY *.py /src/

ENTRYPOINT ["python3", "tf_minimal.py"]

和tf_minimal.py很简单：

import tensorflow as tf

print(tf.__version__)

为了完整起见，我只发布了我正在使用的bashrc文件：

# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# ==============================================================================

export PS1="\[\e[31m\]tf-docker\[\e[m\] \[\e[33m\]\w\[\e[m\] > "
export TERM=xterm-256color
alias grep="grep --color=auto"
alias ls="ls --color=auto"

echo -e "\e[1;31m"
cat<<TF
________                               _______________                
___  __/__________________________________  ____/__  /________      __
__  /  _  _ \_  __ \_  ___/  __ \_  ___/_  /_   __  /_  __ \_ | /| / /
_  /   /  __/  / / /(__  )/ /_/ /  /   _  __/   _  / / /_/ /_ |/ |/ / 
/_/    \___//_/ /_//____/ \____//_/    /_/      /_/  \____/____/|__/

TF
echo -e "\e[0;33m"

if [[ $EUID -eq 0 ]]; then
  cat <<WARN
WARNING: You are running this container as root, which can cause new files in
mounted volumes to be created as the root user on your host machine.

To avoid this, run the container by specifying your user's userid:

$ docker run -u \$(id -u):\$(id -g) args...
WARN
else
  cat <<EXPL
You are running this container as user with ID $(id -u) and group $(id -g),
which should map to the ID and group for your user on the Docker host. Great!
EXPL
fi

# Turn off colors
echo -e "\e[m"

启用Docker GPU的版本（> 19.03）无法成功加载tensorflow

1 个答案: