Tensorflow在mnist-deep.py中崩溃

时间:2017-10-18 08:35:38

标签: tensorflow keras tensorflow-gpu

目前正在使用来自Tensorflow教程的mnist-deep.py在Geforce 1080(8Gb)上使用mnist-deep.py进行培训,该计算机上有16GB的ram。安装了所有最新的CUDA库和驱动程序。一切都在Tensorflow 1.3上运行。在我决定执行一些Keras vdsr培训(https://github.com/jackie840129/VDSR-reduction_with-Keras)的培训之前,mnist-deep.py脚本一直工作正常,没有任何错误。训练被绞死,GPU丢失(无法通过nvidia-smi访问)。重新启动后尝试执行返回mnist-deep.py并不断获取下面的错误。我还不清楚是什么原因导致了这个问题。重新启动,重新安装cuda don似乎没有解决这些问题。重新拍摄机器似乎可以解决问题,但这似乎并不是一种实用的方法。什么可能导致问题的想法,以及如何解决问题?

Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting /tmp/tensorflow/mnist/input_data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting /tmp/tensorflow/mnist/input_data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /tmp/tensorflow/mnist/input_data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /tmp/tensorflow/mnist/input_data/t10k-labels-idx1-ubyte.gz
Saving graph to: /tmp/tmpgb1l75z_
2017-10-18 15:36:28.098787: W 
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow 
library wasn't compiled to use SSE4.1 instructions, but these are 
available on your machine and could speed up CPU computations.
2017-10-18 15:36:28.098807: W 
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow 
library wasn't compiled to use SSE4.2 instructions, but these are 
available on your machine and could speed up CPU computations.
2017-10-18 15:36:28.098814: W 
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow 
library wasn't compiled to use AVX instructions, but these are 
available on your machine and could speed up CPU computations.
2017-10-18 15:36:28.098820: W 
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow 
library wasn't compiled to use AVX2 instructions, but these are 
available on your machine and could speed up CPU computations.
2017-10-18 15:36:28.098825: W 
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow 
library wasn't compiled to use FMA instructions, but these are 
available on your machine and could speed up CPU computations.
2017-10-18 15:36:28.760202: I 
tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful 
NUMA node read from SysFS had negative value (-1), but there must be 
at least one NUMA node, so returning NUMA node zero
2017-10-18 15:36:28.760643: I 
tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 
with properties: 
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7715
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 7.81GiB
2017-10-18 15:36:28.760657: I 
tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 
2017-10-18 15:36:28.760664: I 
tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y 
2017-10-18 15:36:28.760672: I 
tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating 
TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci 
bus id: 0000:01:00.0)
2017-10-18 15:36:31.546892: E 
tensorflow/stream_executor/cuda/cuda_driver.cc:1073] failed to get 
elapsed time between events: CUDA_ERROR_NOT_READY
2017-10-18 15:36:32.547035: E 
tensorflow/stream_executor/cuda/cuda_driver.cc:1073] failed to get 
elapsed time between events: CUDA_ERROR_NOT_READY
2017-10-18 15:36:32.549299: E 
tensorflow/stream_executor/cuda/cuda_blas.cc:366] failed to create 
cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2017-10-18 15:36:32.549317: W 
tensorflow/stream_executor/stream.cc:1756] attempting to perform BLAS 
operation using StreamExecutor without BLAS support
Traceback (most recent call last):
File "/home/nmh/env/lib/python3.6/site-
packages/tensorflow/python/client/session.py", line 1327, in _do_call
return fn(*args)
File "/home/nmh/env/lib/python3.6/site-
packages/tensorflow/python/client/session.py", line 1306, in _run_fn
status, run_metadata)
File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__
next(self.gen)
File "/home/nmh/env/lib/python3.6/site-
packages/tensorflow/python/framework/errors_impl.py", line 466, in 
raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM 
launch failed : a.shape=(50, 3136), b.shape=(3136, 1024), m=50, 
n=1024, k=3136
 [[Node: fc1/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, 
transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"]
(fc1/Reshape, fc1/Variable/read)]]
 [[Node: Mean_1/_7 = _Recv[client_terminated=false, 
recv_device="/job:localhost/replica:0/task:0/cpu:0", 
send_device="/job:localhost/replica:0/task:0/gpu:0", 
send_device_incarnation=1, tensor_name="edge_79_Mean_1", 
tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]
()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "mnist_deep.py", line 178, in <module>
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/home/nmh/env/lib/python3.6/site-
packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "mnist_deep.py", line 165, in main
x: batch[0], y_: batch[1], keep_prob: 1.0})
File "/home/nmh/env/lib/python3.6/site-
packages/tensorflow/python/framework/ops.py", line 541, in eval
return _eval_using_default_session(self, feed_dict, self.graph, 
session)
File "/home/nmh/env/lib/python3.6/site-
packages/tensorflow/python/framework/ops.py", line 4085, in 
_eval_using_default_session
return session.run(tensors, feed_dict)
File "/home/nmh/env/lib/python3.6/site-
packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/home/nmh/env/lib/python3.6/site-
packages/tensorflow/python/client/session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "/home/nmh/env/lib/python3.6/site-
packages/tensorflow/python/client/session.py", line 1321, in _do_run
options, run_metadata)
File "/home/nmh/env/lib/python3.6/site-
packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM 
launch failed : a.shape=(50, 3136), b.shape=(3136, 1024), m=50, 
n=1024, k=3136
 [[Node: fc1/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, 
transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"]
(fc1/Reshape, fc1/Variable/read)]]
 [[Node: Mean_1/_7 = _Recv[client_terminated=false, 
recv_device="/job:localhost/replica:0/task:0/cpu:0", 
send_device="/job:localhost/replica:0/task:0/gpu:0", 
send_device_incarnation=1, tensor_name="edge_79_Mean_1", 
tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]
()]]

Caused by op 'fc1/MatMul', defined at:
File "mnist_deep.py", line 178, in <module>
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/home/nmh/env/lib/python3.6/site-
packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "mnist_deep.py", line 134, in main
y_conv, keep_prob = deepnn(x)
File "mnist_deep.py", line 83, in deepnn
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)
File "/home/nmh/env/lib/python3.6/site-
packages/tensorflow/python/ops/math_ops.py", line 1844, in matmul
a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
File "/home/nmh/env/lib/python3.6/site-
packages/tensorflow/python/ops/gen_math_ops.py", line 1289, in 
_mat_mul
transpose_b=transpose_b, name=name)
 File "/home/nmh/env/lib/python3.6/site  
/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/home/nmh/env/lib/python3.6/site-
packages/tensorflow/python/framework/ops.py", line 2630, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/nmh/env/lib/python3.6/site-
packages/tensorflow/python/framework/ops.py", line 1204, in __init__
self._traceback = self._graph._extract_stack()  # pylint: 
disable=protected-access

InternalError (see above for traceback): Blas GEMM launch failed : 
a.shape=(50, 3136), b.shape=(3136, 1024), m=50, n=1024, k=3136
 [[Node: fc1/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, 
transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"]
(fc1/Reshape, fc1/Variable/read)]]
 [[Node: Mean_1/_7 = _Recv[client_terminated=false, 
recv_device="/job:localhost/replica:0/task:0/cpu:0", 
send_device="/job:localhost/replica:0/task:0/gpu:0", 
send_device_incarnation=1, tensor_name="edge_79_Mean_1", 
tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]
()]]

1 个答案:

答案 0 :(得分:0)

我有一次同样的错误。它是由Out Of Memory错误引起的(因为它占用了RAM,操作系统因为我的训练而死了),因为那是非常暴力的我也失去了与GPU的联系。一些重新启动并删除GPU - 将其重新设置为有效。

您可以查看this question以了解您的问题是否相同。如果是这样,你可能不得不使用较小的网络。