Question

我在我的Ubuntu 16.04.3 LTS Xenial / nVidia GTX 1080 Ti机器上安装了Tensorflow for Python。然后， nVidia驱动程序已更新从374更新为384.90（nvidia-smi报告NVIDIA-SMI 384.90）。

从那时起，我只能在root或CPU模式下运行我的程序。例如，使用常规用户帐户运行时，MNIST example因CUDNN_STATUS_INTERNAL_ERROR错误而导致失败：

E tensorflow/stream_executor/cuda/cuda_dnn.cc:371] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
E tensorflow/stream_executor/cuda/cuda_dnn.cc:338] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
F tensorflow/core/kernels/conv_ops.cc:672] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms)

我已经尝试在整个过程official installation guide for TF r1.3之后多次以各种组合重新安装驱动程序/ CUDA / cudnn 。

无论我在网上找到什么解决方案（大多数建议这是内存的问题，对于10GB卡试图运行MNIST的可能性不大）已经尝试过，但没有帮助，例如：

知道如何解决此问题吗？

详细信息

更新，详见/var/log/apt/history.log

中的日志

Start-Date: 2017-10-25  06:54:42¬
Commandline: /usr/bin/unattended-upgrade¬
Install: nvidia-384-dev:amd64 (384.90-0ubuntu0.16.04.1, automatic), libcuda1-384:amd64 (384.90-0ubuntu0.16.04.1, automatic), nvidia-opencl-icd-384:amd64 (384.90-0ubuntu0.16.04.1, automatic), nvidia-384:amd64 (384.90-0ubuntu0.16.04.1, automatic)¬
Upgrade: libcurl3:amd64 (7.47.0-1ubuntu2.3, 7.47.0-1ubuntu2.4), libcuda1-375:amd64 (375.66-0ubuntu0.16.04.1, 384.90-0ubuntu0.16.04.1), libicu55:amd64 (55.1-7ubuntu0.2, 55.1-7ubuntu0.3), chromium-browser:amd64 (61.0.3163.100-0ubuntu0.16.04.1306, 62.0.3202.62-0ubuntu0.16.04.1308), chromium-codecs-ffmpeg-extra:amd64 (61.0.3163.100-0ubuntu0.16.04.1306, 62.0.3202.62-0ubuntu0.16.04.1308), nvidia-375-dev:amd64 (375.66-0ubuntu0.16.04.1, 384.90-0ubuntu0.16.04.1), libwebkit2gtk-4.0-37:amd64 (2.16.6-0ubuntu0.16.04.1, 2.18.0-0ubuntu0.16.04.2), mysql-common:amd64 (5.7.19-0ubuntu0.16.04.1, 5.7.20-0ubuntu0.16.04.1), libmysqlclient20:amd64 (5.7.19-0ubuntu0.16.04.1, 5.7.20-0ubuntu0.16.04.1), libicu-dev:amd64 (55.1-7ubuntu0.2, 55.1-7ubuntu0.3), icu-devtools:amd64 (55.1-7ubuntu0.2, 55.1-7ubuntu0.3), chromium-browser-l10n:amd64 (61.0.3163.100-0ubuntu0.16.04.1306, 62.0.3202.62-0ubuntu0.16.04.1308), curl:amd64 (7.47.0-1ubuntu2.3, 7.47.0-1ubuntu2.4), libjavascriptcoregtk-4.0-18:amd64 (2.16.6-0ubuntu0.16.04.1, 2.18.0-0ubuntu0.16.04.2), nvidia-opencl-icd-375:amd64 (375.66-0ubuntu0.16.04.1, 384.90-0ubuntu0.16.04.1), libcurl3-gnutls:amd64 (7.47.0-1ubuntu2.3, 7.47.0-1ubuntu2.4), nvidia-375:amd64 (375.66-0ubuntu0.16.04.1, 384.90-0ubuntu0.16.04.1), libwebkit2gtk-4.0-37-gtk2:amd64 (2.16.6-0ubuntu0.16.04.1, 2.18.0-0ubuntu0.16.04.2)¬
End-Date: 2017-10-25  06:56:00

我可以再次在常规用户下运行Tensorflow installation docs

中的简单验证程序

import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))

但MNIST example仍然失败：

(venv-test)$~/tensorflow-validate/models/official/mnist$ python mnist.py
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_keep_checkpoint_every_n_hours': 10000, '_tf_random_seed': 1, '_keep_checkpoint_max': 5, '_session_config': None, '_model_dir': '/tmp/mnist_model', '_save_summary_steps': 100, '_log_step_count_steps': 100, '_save_checkpoints_secs': 600, '_save_checkpoints_steps': None}
INFO:tensorflow:Create CheckpointSaverHook.
2017-10-31 18:39:05.951324: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-31 18:39:05.951342: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-31 18:39:05.951346: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-10-31 18:39:05.951348: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-31 18:39:05.951366: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-10-31 18:39:06.591310: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-10-31 18:39:06.591682: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties:
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.582
pciBusID 0000:01:00.0
Total memory: 10.91GiB
Free memory: 10.75GiB
2017-10-31 18:39:06.591693: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0
2017-10-31 18:39:06.591696: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y
2017-10-31 18:39:06.591701: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0)
2017-10-31 18:39:07.977441: E tensorflow/stream_executor/cuda/cuda_dnn.cc:371] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2017-10-31 18:39:07.977466: E tensorflow/stream_executor/cuda/cuda_dnn.cc:338] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
2017-10-31 18:39:07.977472: F tensorflow/core/kernels/conv_ops.cc:672] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms)

我尝试重新安装如下：

安装CUDA 8

$ sudo apt install cuda-8-0
Reading package lists... Done
Building dependency tree
Reading state information... Done
cuda-8-0 is already the newest version (8.0.61-1).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.

和

$ echo $CUDA_HOME
/usr/local/cuda-8.0
$ echo $LD_LIBRARY_PATH
/usr/local/cuda-8.0/lib64

libcupti-dev已安装

$ sudo apt-get install libcupti-dev
Reading package lists... Done
Building dependency tree
Reading state information... Done
libcupti-dev is already the newest version (7.5.18-0ubuntu1).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.

使用virtualenv

创建新环境

$ virtualenv --system-site-packages -p python3 ~/venv-test
Already using interpreter /usr/bin/python3
Using base prefix '/usr'
New python executable in /home/represent/venv-test/bin/python3
Also creating executable in /home/represent/venv-test/bin/python
Installing setuptools, pip, wheel...done.

使用pip

中的virtualenv安装了Tensorflow

（venv-test）代表@ gatekeeper：/ data / installers $ sudo pip3 install --upgrade tensorflow-gpu ... 成功安装了bleach-1.5.0 html5lib-0.9999999 markdown-2.6.9 numpy-1.13.3 protobuf-3.4.0 setuptools-36.6.0 six-1.11.0 tensorflow-gpu-1.3.0 tensorflow-tensorboard-0.1.8 wheel -0.30.0

已安装cuDNN 6.0.12

$ sudo dpkg -i libcudnn6_6.0.21-1+cuda8.0_amd64.deb
Selecting previously unselected package libcudnn6.
(Reading database ... 226608 files and directories currently installed.)
Preparing to unpack libcudnn6_6.0.21-1+cuda8.0_amd64.deb ...
Unpacking libcudnn6 (6.0.21-1+cuda8.0) ...
Setting up libcudnn6 (6.0.21-1+cuda8.0) ...
Processing triggers for libc-bin (2.23-0ubuntu9) ...
/sbin/ldconfig.real: /usr/lib/nvidia-384/libEGL.so.1 is not a symbolic link

/sbin/ldconfig.real: /usr/lib32/nvidia-384/libEGL.so.1 is not a symbolic link

和开发包

$ sudo dpkg -i libcudnn6-dev_6.0.21-1+cuda8.0_amd64.deb
Selecting previously unselected package libcudnn6-dev.
(Reading database ... 226614 files and directories currently installed.)
Preparing to unpack libcudnn6-dev_6.0.21-1+cuda8.0_amd64.deb ...
Unpacking libcudnn6-dev (6.0.21-1+cuda8.0) ...
Setting up libcudnn6-dev (6.0.21-1+cuda8.0) ...
update-alternatives: using /usr/include/x86_64-linux-gnu/cudnn_v6.h to provide /usr/include/cudnn.h (libcudnn) in auto mode

验证安装

(venv-test)$ python
Python 3.5.2 (default, Sep 14 2017, 22:51:06)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> tf.VERSION
'1.3.0'

Answer 1

我们与Jan Benes一起研究了这个问题，发现解决方案是添加我们的非root用户到nvidia-persistenced组。例如，sudo usermod -a -G nvidia-persistenced our-nonroot-user。

这背后的原因是，在Ubuntu上默认安装nvidia驱动程序（在我们的例子中为nvidia-384）会创建名为nvidia-persistenced的用户和组，然后该用户将用于运行NVIDIA Persistence Daemon 。如果我们的用户非root用户无法访问文件并由此守护程序写入，则MNIST示例失败。 root用户没有失败（因为它可以访问所有内容），并且在我们将非root用户添加到nvidia-persistenced组后停止了失败。

Tensorflow仅在驱动程序更新后才在root下运行

详细信息

1 个答案: