当我使用slurm(http://slurm.schedmd.com/)工作负载管理器时出现此错误。当我运行一些tensorflow python脚本时,有时会导致错误(附加)。它似乎无法找到安装的cuda库,但我正在运行不需要GPU的脚本。因此,我发现为什么cuda会成为一个问题非常令人困惑。如果我不需要,为什么cuda安装会出现问题?
我从slurm-job_id文件中获得的唯一有用信息如下:
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:102] Couldn't open CUDA library libcudnn.so. LD_LIBRARY_PATH: /cm/shared/openmind/cuda/7.5/lib64:/cm/shared/openmind/cuda/7.5/lib
I tensorflow/stream_executor/cuda/cuda_dnn.cc:2092] Unable to load cuDNN DSO
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:153] retrieving CUDA diagnostic information for host: node047
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: node047
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:347] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 352.63 Sat Nov 7 21:25:42 PST 2015
GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC)
"""
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: 352.63.0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:81] No GPU devices available on machine.
我一直以为张量流不需要GPU。所以我假设最后一个错误,说没有GPU不会导致错误(如果我错了,请纠正我)。
我不明白为什么我需要CUDA库。我正在尝试使用GPU运行我的工作,如果我的工作是CPU工作,为什么我需要cuda库?
我尝试直接登录节点并启动tensorflow,但我没有明显错误:
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:102] Couldn't open CUDA library libcudnn.so. LD_LIBRARY_PATH: /cm/shared/openmind/cuda/7.5/lib64:/cm/shared/openmind/cuda/7.5/lib
I tensorflow/stream_executor/cuda/cuda_dnn.cc:2092] Unable to load cuDNN DSO
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
虽然我预料到错误:
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:102] Couldn't open CUDA library libcudnn.so. LD_LIBRARY_PATH: /cm/shared/openmind/cuda/7.5/lib64:/cm/shared/openmind/cuda/7.5/lib
I tensorflow/stream_executor/cuda/cuda_dnn.cc:2092] Unable to load cuDNN DSO
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:153] retrieving CUDA diagnostic information for host: node047
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: node047
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:347] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 352.63 Sat Nov 7 21:25:42 PST 2015
GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC)
"""
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: 352.63.0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:81] No GPU devices available on machine.
我还在tensorflow库中发了一个正式的git问题:
答案 0 :(得分:1)
tensorflow中存在一些错误,通过批处理作业提交slurm。
目前我通过在slurm上运行srun来解决它。
在您的情况下,您还可以安装GPU版本的tensorflow并在没有GPU的计算机上运行它。这导致了你的另一个错误。
答案 1 :(得分:0)
我遇到了类似的问题,在将模型写入光泽文件系统时,我设法将其归结为保护程序。仍在等待真正的解决方案。