Question

我一直在跟踪Tensorflow中的SEGFAULT。可以使用以下代码段重现该问题：

import tensorflow as tf                                                                                                                                                                                                                       

with tf.device('/cpu:0'):
    xin = tf.placeholder(tf.float32, [None, 1, 1], name='input')
    rnn_cell = tf.contrib.rnn.LSTMCell(1)
    out, _ = tf.nn.dynamic_rnn(rnn_cell, xin, dtype=tf.float32)
    out = tf.layers.batch_normalization(out, training=True)
    out = tf.identity(out, name='output')

    optimiser = tf.train.AdamOptimizer(.0001)
    update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
    with tf.control_dependencies(update_ops):
        out = optimiser.minimize(out, global_step=tf.Variable(0, dtype=tf.float32), name='train_op')

config = tf.ConfigProto(allow_soft_placement = False)
sess = tf.Session(config=config)
sess.run(tf.global_variables_initializer())

sample_in = [[[0]]]
sess.run(out, feed_dict={xin: sample_in})

我设法追查问题，我有一个pull-request for it on github。如果要使用我的补丁运行此代码，则会收到以下错误消息：

2018-04-03 13:09:24.326950: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 0 with properties:                                                                                                                          
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:65:00.0
totalMemory: 11.90GiB freeMemory: 11.74GiB
2018-04-03 13:09:24.326982: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0
2018-04-03 13:09:24.512956: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11366 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:65:00.0, compute capability: 6.1)
Traceback (most recent call last):
  File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1361, in _do_call
    return fn(*args)
  File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1340, in _run_fn
    target_list, status, run_metadata)
  File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: Cycle detected when adding enter->frame edge: Edge from gradients/f_count to (null) would create a cycle.
+-> (null)
|   rnn/TensorArrayStack/TensorArrayGatherV3
|   rnn/transpose_1
|   batch_normalization/moments/mean
|   batch_normalization/moments/Squeeze
|   batch_normalization/AssignMovingAvg/sub
|   batch_normalization/AssignMovingAvg/mul
|   batch_normalization/AssignMovingAvg
+-- gradients/f_count


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "breakage.py", line 21, in <module>
    sess.run(out, feed_dict={xin: sample_in})
  File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
  File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1137, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1355, in _do_run
    options, run_metadata)
  File "/home/thom/.python/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1374, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Cycle detected when adding enter->frame edge: Edge from gradients/f_count to (null) would create a cycle.
+-> (null)
|   rnn/TensorArrayStack/TensorArrayGatherV3
|   rnn/transpose_1
|   batch_normalization/moments/mean
|   batch_normalization/moments/Squeeze
|   batch_normalization/AssignMovingAvg/sub
|   batch_normalization/AssignMovingAvg/mul
|   batch_normalization/AssignMovingAvg
+-- gradients/f_count

这似乎表明我的示例代码存在拓扑问题。每当我结合任何类型的RNN，批量标准化和the required additional control dependency

时，问题似乎就会发生

我设法通过依赖tf.contrib.layers.batch_norm并将updates_collections参数设置为None以内联更新操作来缓解此问题。

供参考，以下是更新的代码示例：

import tensorflow as tf                                                                                                                                                                                                                       

with tf.device('/cpu:0'):
    xin = tf.placeholder(tf.float32, [None, 1, 1], name='input')
    rnn_cell = tf.contrib.rnn.LSTMCell(1)
    out, _ = tf.nn.dynamic_rnn(rnn_cell, xin, dtype=tf.float32)
    out = tf.contrib.layers.batch_norm(out, is_training=True, updates_collections=None)
    out = tf.identity(out, name='output')

    optimiser = tf.train.AdamOptimizer(.0001)
    out = optimiser.minimize(out, global_step=tf.Variable(0, dtype=tf.float32), name='train_op')

config = tf.ConfigProto(allow_soft_placement = False)
sess = tf.Session(config=config)
sess.run(tf.global_variables_initializer())

sample_in = [[[0]]]
sess.run(out, feed_dict={xin: sample_in})

根据the documentation，这可能会对性能造成不利影响，而且我不清楚我最初做错了什么。我的代码看起来是否正确？

另请注意，只有在使用XLA JIT支持构建Tensorflow时才会出现此问题，这让我觉得它可能是Tensorflow中的错误。

编辑：我还提交了一个问题on Github

使用Tensorflow和RNNs＆amp;批量标准化

0 个答案: