MNIST For ML Beginners
教程在运行print(sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels}))
时给出了错误。其他一切都很好。
错误和追踪:
InternalErrorTraceback (most recent call last)
<ipython-input-16-219711f7d235> in <module>()
----> 1 print(sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels}))
/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in run(self, fetches, feed_dict, options, run_metadata)
338 try:
339 result = self._run(None, fetches, feed_dict, options_ptr,
--> 340 run_metadata_ptr)
341 if run_metadata:
342 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _run(self, handle, fetches, feed_dict, options, run_metadata)
562 try:
563 results = self._do_run(handle, target_list, unique_fetches,
--> 564 feed_dict_string, options, run_metadata)
565 finally:
566 # The movers are no longer used. Delete them.
/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
635 if handle is None:
636 return self._do_call(_run_fn, self._session, feed_dict, fetch_list,
--> 637 target_list, options, run_metadata)
638 else:
639 return self._do_call(_prun_fn, self._session, handle, feed_dict,
/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _do_call(self, fn, *args)
657 # pylint: disable=protected-access
658 raise errors._make_specific_exception(node_def, op, error_message,
--> 659 e.code)
660 # pylint: enable=protected-access
661
InternalError: Dst tensor is not initialized.
[[Node: _recv_Placeholder_3_0/_1007 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_312__recv_Placeholder_3_0", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
[[Node: Mean_1/_1011 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_319_Mean_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
我刚刚切换到更新版本的CUDA,所以这可能与此有关吗?似乎这个错误是关于将张量复制到GPU。
堆栈:EC2 g2.8xlarge机器,Ubuntu 14.04
更新:
print(sess.run(accuracy, feed_dict={x: batch_xs, y_: batch_ys}))
运行正常。这让我怀疑问题在于我试图将巨大的张量传递给GPU并且它无法接受它。像小型巴士这样的小型张量工作得很好。
更新2:
我已经确定了导致这个问题的张量究竟有多大:
batch_size = 7509 #Works.
print(sess.run(accuracy, feed_dict={x: mnist.test.images[0:batch_size], y_: mnist.test.labels[0:batch_size]}))
batch_size = 7510 #Doesn't work. Gets the Dst error.
print(sess.run(accuracy, feed_dict={x: mnist.test.images[0:batch_size], y_: mnist.test.labels[0:batch_size]}))
答案 0 :(得分:17)
为简洁起见,当没有足够的内存来处理批量大小时会生成此错误消息。
扩展Steven的链接(我还不能发表评论),这里有一些技巧来监控/控制Tensorflow中的内存使用情况:
答案 1 :(得分:3)
请记住,ec2 g2.8xlarge只有4 gb的gpu内存 https://aws.amazon.com/ec2/instance-types/
除了运行批量大小为1之外,我没有一个好方法可以找出模型占用多少空间,然后你可以减去一个图像占用多少空间。
从那里,您可以确定最大批量大小。这应该工作,但我认为tensorflow动态分配gpu内存类似于火炬,而不像caffe,它会阻止它从get go开始所需的最大gpu空间。因此,您可能希望对最大批量大小保守。
答案 2 :(得分:0)
我认为此链接可以帮助https://github.com/aymericdamien/TensorFlow-Examples/issues/38#issuecomment-223793214。
在我的情况下,这是GPU繁忙(占93%繁忙),正在screen
中训练另一个模型。我需要终止该过程,后来很高兴看到工作正常。