我正在尝试在Google Cloud上设置tensorflow-gpu。
这是可用的GPU
arnoldwright1@gpu:~$ ubuntu-drivers devices
== /sys/devices/pci0000:00/0000:00:04.0 ==
modalias : pci:v000010DEd000015F8sv000010DEsd0000118Fbc03sc02i00
vendor : NVIDIA Corporation
model : GP100GL [Tesla P100 PCIe 16GB]
driver : nvidia-driver-390 - distro non-free recommended
driver : xserver-xorg-video-nouveau - distro free builtin
我已经安装了nvidia驱动程序,此处使用nvidia-smi
确认
arnoldwright1@gpu:~$ nvidia-smi
Tue Feb 26 12:44:54 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:00:04.0 Off | 0 |
| N/A 38C P0 26W / 250W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
检查CUDA版本为9.1
arnoldwright1@gpu:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85
在这里,我按照this教程安装了cudnn驱动程序cudnn-9.1-linux-x64-v7.1.tgz
,但使用的是更新版本的驱动程序。
在确认tensorflow-gpu是否正确设置的最后一步时,出现此错误。
In [1]: from tensorflow.python.client import device_lib
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-1-48501209a23d> in <module>()
----> 1 from tensorflow.python.client import device_lib
/home/arnoldwright1/.local/lib/python2.7/site-packages/tensorflow/__init__.py in <module>()
22
23 # pylint: disable=g-bad-import-order
---> 24 from tensorflow.python import pywrap_tensorflow # pylint: disable=unused-import
25
26 try:
/home/arnoldwright1/.local/lib/python2.7/site-packages/tensorflow/python/__init__.py in <module>()
47 import numpy as np
48
---> 49 from tensorflow.python import pywrap_tensorflow
50
51 from tensorflow.python.tools import component_api_helper
/home/arnoldwright1/.local/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow.py in <module>()
72 for some common reasons and solutions. Include the entire stack trace
73 above this error message when asking for help.""" % traceback.format_exc()
---> 74 raise ImportError(msg)
75
76 # pylint: enable=wildcard-import,g-import-not-at-top,unused-import,line-too-long
ImportError: Traceback (most recent call last):
File "/home/arnoldwright1/.local/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
from tensorflow.python.pywrap_tensorflow_internal import *
File "/home/arnoldwright1/.local/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
_pywrap_tensorflow_internal = swig_import_helper()
File "/home/arnoldwright1/.local/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
_mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory
我已经检查了这个问题,这似乎是对cuda或cudnn驱动程序位置的错误配置,但我无法弄清源代码在哪里。
我还编辑了bash文件,使其在开始时包含以下几行。
export PATH=/usr/local/cuda-10.0/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}