我一直在跟踪Tensorflow中的SEGFAULT。可以使用以下代码段重现该问题:
import tensorflow as tf
with tf.device('/cpu:0'):
xin = tf.placeholder(tf.float32, [None, 1, 1], name='input')
rnn_cell = tf.contrib.rnn.LSTMCell(1)
out, _ = tf.nn.dynamic_rnn(rnn_cell, xin, dtype=tf.float32)
out = tf.layers.batch_normalization(out, training=True)
out = tf.identity(out, name='output')
optimiser = tf.train.AdamOptimizer(.0001)
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
out = optimiser.minimize(out, global_step=tf.Variable(0, dtype=tf.float32), name='train_op')
config = tf.ConfigProto(allow_soft_placement = False)
sess = tf.Session(config=config)
sess.run(tf.global_variables_initializer())
sample_in = [[[0]]]
sess.run(out, feed_dict={xin: sample_in})
我设法追查问题,我有一个pull-request for it on github。如果要使用我的补丁运行此代码,则会收到以下错误消息:
2018-04-03 13:09:24.326950: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 0 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:65:00.0
totalMemory: 11.90GiB freeMemory: 11.74GiB
2018-04-03 13:09:24.326982: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0
2018-04-03 13:09:24.512956: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11366 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:65:00.0, compute capability: 6.1)
Traceback (most recent call last):
File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1361, in _do_call
return fn(*args)
File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1340, in _run_fn
target_list, status, run_metadata)
File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: Cycle detected when adding enter->frame edge: Edge from gradients/f_count to (null) would create a cycle.
+-> (null)
| rnn/TensorArrayStack/TensorArrayGatherV3
| rnn/transpose_1
| batch_normalization/moments/mean
| batch_normalization/moments/Squeeze
| batch_normalization/AssignMovingAvg/sub
| batch_normalization/AssignMovingAvg/mul
| batch_normalization/AssignMovingAvg
+-- gradients/f_count
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "breakage.py", line 21, in <module>
sess.run(out, feed_dict={xin: sample_in})
File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1137, in _run
feed_dict_tensor, options, run_metadata)
File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1355, in _do_run
options, run_metadata)
File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1374, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Cycle detected when adding enter->frame edge: Edge from gradients/f_count to (null) would create a cycle.
+-> (null)
| rnn/TensorArrayStack/TensorArrayGatherV3
| rnn/transpose_1
| batch_normalization/moments/mean
| batch_normalization/moments/Squeeze
| batch_normalization/AssignMovingAvg/sub
| batch_normalization/AssignMovingAvg/mul
| batch_normalization/AssignMovingAvg
+-- gradients/f_count
这似乎表明我的示例代码存在拓扑问题。每当我结合任何类型的RNN,批量标准化和the required additional control dependency
时,问题似乎就会发生我设法通过依赖tf.contrib.layers.batch_norm
并将updates_collections
参数设置为None
以内联更新操作来缓解此问题。
供参考,以下是更新的代码示例:
import tensorflow as tf
with tf.device('/cpu:0'):
xin = tf.placeholder(tf.float32, [None, 1, 1], name='input')
rnn_cell = tf.contrib.rnn.LSTMCell(1)
out, _ = tf.nn.dynamic_rnn(rnn_cell, xin, dtype=tf.float32)
out = tf.contrib.layers.batch_norm(out, is_training=True, updates_collections=None)
out = tf.identity(out, name='output')
optimiser = tf.train.AdamOptimizer(.0001)
out = optimiser.minimize(out, global_step=tf.Variable(0, dtype=tf.float32), name='train_op')
config = tf.ConfigProto(allow_soft_placement = False)
sess = tf.Session(config=config)
sess.run(tf.global_variables_initializer())
sample_in = [[[0]]]
sess.run(out, feed_dict={xin: sample_in})
根据the documentation,这可能会对性能造成不利影响,而且我不清楚我最初做错了什么。我的代码看起来是否正确?
另请注意,只有在使用XLA JIT支持构建Tensorflow时才会出现此问题,这让我觉得它可能是Tensorflow中的错误。
编辑:我还提交了一个问题on Github