Tensorflow:GPU识别不一致

时间:2017-03-25 00:51:12

标签: tensorflow

我使用Tensorflow说明here

中的示例代码检查了我的Tensorflow安装是否正在使用我的GPU

当我第一次运行代码时,我得到了这个输出:

$ python gpu-test.py

出:

I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA    library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: 
name: GRID K520
major: 3 minor: 0 memoryClockRate (GHz) 0.797
pciBusID 0000:00:03.0
Total memory: 3.94GiB
Free memory: 3.91GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0)
Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: GRID K520, pci bus id: 0000:00:03.0
I tensorflow/core/common_runtime/direct_session.cc:255] Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: GRID K520, pci bus id: 0000:00:03.0

MatMul: (MatMul): /job:localhost/replica:0/task:0/gpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] MatMul: (MatMul)/job:localhost/replica:0/task:0/gpu:0
b: (Const): /job:localhost/replica:0/task:0/gpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] b: (Const)/job:localhost/replica:0/task:0/gpu:0
a: (Const): /job:localhost/replica:0/task:0/gpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] a: (Const)/job:localhost/replica:0/task:0/gpu:0
[[ 22.  28.]
 [ 49.  64.]]

它正在使用GPU,一切都很好!

有了这个确定性,我推出了一个带有大型CNN的Jupyter笔记本并训练它,而且速度非常慢。

我很困惑,第二次跑gpu-test.py。这一次,即使在此期间没有任何变化,我得到了不同的输出:

I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
E tensorflow/stream_executor/cuda/cuda_driver.cc:509] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: ip-172-31-19-90
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: ip-172-31-19-90
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 375.39.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:363] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module  367.57  Mon Oct  3 20:37:01 PDT 2016
GCC version:  gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4) 
"""
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 367.57.0
E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:303] kernel version 367.57.0 does not match DSO version 375.39.0 -- cannot find working devices in this configuration
Device mapping: no known devices.
I tensorflow/core/common_runtime/direct_session.cc:255] Device mapping:

MatMul: (MatMul): /job:localhost/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] MatMul: (MatMul)/job:localhost/replica:0/task:0/cpu:0
b: (Const): /job:localhost/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] b: (Const)/job:localhost/replica:0/task:0/cpu:0
a: (Const): /job:localhost/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] a: (Const)/job:localhost/replica:0/task:0/cpu:0
[[ 22.  28.]
 [ 49.  64.]]

我现在感到很困惑。

我在第一次和第二次运行GPU测试之间发生的唯一两件事是:(1)我解压缩了一个文件;(2)我跑了说Jupyter笔记本。 没有任何安装,更新或以任何方式改变了我的系统。

有人可以帮忙吗?

如果在5分钟前没有发生这种情况,怎么会突然发生:

kernel version 367.57.0 does not match DSO version 375.39.0

我该如何更新内核版本?

2 个答案:

答案 0 :(得分:2)

我发现发生的事情:在无人值守的更新中,后台运行的自动驱动程序更新尝试将驱动程序更新为版本375.39.0。

但是,对于此驱动程序版本,AWS g2.2xlarge实例上的GRID K520 GPU太旧了。

尝试的自动更新使系统处于不一致状态并将其全部分解。

对我来说唯一的方法是启动一个新的AWS实例并在启动后立即终止更新过程以保持系统完好无损。非常讨厌的问题:/。

如果有人碰巧遇到同样的问题:

  • 启动全新的AWS g2实例
  • 立即SSH自己
  • 通过在终端中输入top来显示正在运行的流程
  • 检查是否有忙碌的流程说"无人看管......"如果是,则复制其PID(进程ID)
  • kill -9 PID尝试安装更新之前将其杀死

答案 1 :(得分:1)

这意味着您需要将cuda驱动程序更新到最新版本。不确定不一致的来源。