这几天,我修改了官方cifar10_main.py
,以训练kaggle dogs_cats_redux数据集。
首先,我创建了tfrecord文件,遵循标准管道,你们可以下载tfrecord文件here.
然后,我编写了一些tfrecord解析函数和dogs_cats_model类,其余代码与原始resnet存储库相同,你们可以检查我的main.py
here。
但是当我运行main.py
时,它引发了CancelledError:
Traceback (most recent call last):
File "main.py", line 326, in <module>
main(argv=sys.argv)
File "main.py", line 321, in main
shape=[_HEIGHT, _WIDTH, _NUM_CHANNELS])
File "/home/jto/projects/dogs_cats_tf/official/resnet/resnet_run_loop.py", line 396, in resnet_main
max_steps=flags.max_train_steps)
File "/home/jto/anaconda3/envs/tf/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 363, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/jto/anaconda3/envs/tf/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 843, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/home/jto/anaconda3/envs/tf/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 859, in _train_model_default
saving_listeners)
File "/home/jto/anaconda3/envs/tf/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1059, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/home/jto/anaconda3/envs/tf/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 567, in run
run_metadata=run_metadata)
File "/home/jto/anaconda3/envs/tf/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1043, in run
run_metadata=run_metadata)
File "/home/jto/anaconda3/envs/tf/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1134, in run
raise six.reraise(*original_exc_info)
File "/home/jto/anaconda3/envs/tf/lib/python3.5/site-packages/six.py", line 686, in reraise
raise value
File "/home/jto/anaconda3/envs/tf/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1119, in run
return self._sess.run(*args, **kwargs)
File "/home/jto/anaconda3/envs/tf/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1191, in run
run_metadata=run_metadata)
File "/home/jto/anaconda3/envs/tf/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 971, in run
return self._sess.run(*args, **kwargs)
File "/home/jto/anaconda3/envs/tf/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 900, in run
run_metadata_ptr)
File "/home/jto/anaconda3/envs/tf/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1135, in _run
feed_dict_tensor, options, run_metadata)
File "/home/jto/anaconda3/envs/tf/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run
run_metadata)
File "/home/jto/anaconda3/envs/tf/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1335, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.CancelledError: Queue '_2_input_producer' is already closed.
[[Node: input_producer/input_producer_Close = QueueCloseV2[cancel_pending_enqueues=false](input_producer)]]
[[Node: IteratorGetNext = IteratorGetNext[output_shapes=[[?,224,224,3], [?,2]], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](Iterator)]]
[[Node: IteratorGetNext/_2401 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_534_IteratorGetNext", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
我搜索了很多解决方案,其中大多数都说这是因为数据队列被某种方式停止了,或者我们没有正确启动队列,这些解决方案如下:
# Start the data queue
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess, coord)
但是官方的resnet_run_loop.resnet_main()使用tf.estimator.Estimator来训练模型,源代码不需要像这样启动队列,那么我们如何解决这个问题呢?任何想法都将不胜感激。
系统信息: Ubuntu 16.04 LTS tensorflow-gpu v1.8.0 酷达9.0 cudnn 7.1
github问题是here.