Question

我的系统安装了Ubuntu20.04，因此为Tensorflow获取CUDA和cudnn的正确组合似乎有些棘手。我尝试了CUDA11，但无法使用cudnn，因此我通过sudo apt install nvidia-cuda-toolkit和相应的cudnn（7.6.5）（some helpful answers）安装了CUDA10.1。现在，当我安装Tensorflow-gpu 2时，我可以轻松地检查它是否使用了GPU：

import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

给出正确的输出2。但是我需要使用Tensorflow-gpu-1.15。因此，我根据答案in this SO post尝试了以下方法：

import tensorflow as tf
with tf.device('/gpu:0'):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
    c = tf.matmul(a, b)

with tf.Session() as sess:
    print (sess.run(c))

给出以下输出：

2020-07-11 14:05:53.181428: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-11 14:05:53.183404: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:01:00.0
2020-07-11 14:05:53.183598: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-11 14:05:53.185222: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: 
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:02:00.0
2020-07-11 14:05:53.185548: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/cuda/include:/usr/lib/cuda/lib64:
2020-07-11 14:05:53.185790: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/cuda/include:/usr/lib/cuda/lib64:
2020-07-11 14:05:53.186015: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/cuda/include:/usr/lib/cuda/lib64:
2020-07-11 14:05:53.186237: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/cuda/include:/usr/lib/cuda/lib64:
2020-07-11 14:05:53.186459: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/cuda/include:/usr/lib/cuda/lib64:
2020-07-11 14:05:53.186578: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/cuda/include:/usr/lib/cuda/lib64:
2020-07-11 14:05:53.186594: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-07-11 14:05:53.186601: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2020-07-11 14:05:53.187652: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-11 14:05:53.187669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      
Traceback (most recent call last):
  File "/home/mo/anaconda3/envs/lf/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
  File "/home/mo/anaconda3/envs/lf/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1348, in _run_fn
self._extend_graph()
  File "/home/mo/anaconda3/envs/lf/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1388, in _extend_graph
    tf_session.ExtendSession(self._session)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation MatMul: {{node MatMul}} was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:1 ]. Make sure the device specification refers to a valid device.
 [[MatMul]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/home/mo/anaconda3/envs/lf/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
  File "/home/mo/anaconda3/envs/lf/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
  File "/home/mo/anaconda3/envs/lf/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
  File "/home/mo/anaconda3/envs/lf/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation MatMul: node MatMul (defined at /home/mo/anaconda3/envs/lf/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748)  was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:1 ]. Make sure the device specification refers to a valid device.
 [[MatMul]]

我无法理解问题，是否需要其他CUDA版本（10.0可能是由于缺少库），还是应该更改tf.device中的设备名称，所以不确定CUDA10 .0可以安装在ubuntu20上，所以可以安装较旧的ubuntu版本吗？

Answer 1

在使CUDA正常工作时，我遇到了类似的问题。我的解决方案是降级到Ubuntu 18.04，并确保我具有gcc，CUDA和tensorflow的正确组合，如测试的构建配置中所列：

https://www.tensorflow.org/install/source#tested_build_configurations

我的解决方案的原始内容记录在以下StackOverflow问题中：

ImportError: libcublas.so.10.0: cannot open shared object file: No such file or director

Answer 2

您对诸如libcublas, libcurand, etc之类的Cuda库有疑问。尝试正确设置LD_LIBRARY_PATH并检查/usr/lib/cuda/include:/usr/lib/cuda/lib64包含哪些目录。

Tensorflow-gpu 1.15不使用GPU

2 个答案: