tensorflow - 分布式TensorFlow train.Supervisor RuntimeError on stop

我正在测试分布式TensorFlow，几乎与inception_distributed_train.py（同步数据并行）相同，但使用的是基本示例的默认mnist dataset。

对于主工作人员（仅限），对于sv = tf.train.Supervisor，结束sv.stop()会给RuntimeError: ('Coordinator stopped with threads still running: %s', 'Thread-4')。

同时在ps节点上我看到了日志Variable:0: Skipping cancelled dequeue attempt with queue not closed，同样也看到了变量1-7，尽管有趣的是没有将变量8定义为global_step = tf.Variable(0)并传递给{{ 1}}，方法tf.train.SyncReplicasOptimizer为minimize。

这个错误会对任何人敲响吗？我真的看不出我的逻辑与inception_distributed_train.py

的逻辑不同

分布式TensorFlow train.Supervisor RuntimeError on stop

0 个答案: