在训练一个非常标准的卷积网络时,我发现了一个奇怪的错误。一切都以良好的损失曲线开始,但突然损失降到了零。我能够将nans一直追溯到输入管道。
如您所见,我正在使用tf.train.shuffle_batch()批处理之前和之后打印错误。第二个印刷品以nan出现,并在问题中一直传播。
可能是什么原因造成的?我玩过不同的容量,线程等值。
代码和上下文如下。 Nans出现在批处理的前/后图像中,但不出现在前/后图像中。
我应该注意tfrecord文件中包含任意数量的示例,但是我认为这对于入队/出队操作无关紧要。
def input_pipeline(self, filenames, batch_size, num_epochs=None):
"""Function that creates a highly abstracted input pipeline consisting
of a bunch of threads and queues given a few simple parameters.
See https://www.tensorflow.org/versions/r0.10/how_tos/reading_data/index.html#multiple-input-pipelines
for more information and in-depth explanations.
Args:
- filenames: a list of filenames of tfrecords files
random.shuffle(filenames)
train_filenames = filenames
train_filename_queue = (
tf.train.string_input_producer(train_filenames,
num_epochs=num_epochs,
shuffle=True,
seed=1))
before_image, after_image, mask_image = (
self._read_and_decode_reach_tfrecords(train_filename_queue))
# min_after_dequeue defines how big a buffer we will randomly sample
# from -- bigger means better shuffling but slower start up and more
# memory used.
# capacity must be larger than min_after_dequeue and the amount larger
# determines the maximum we will prefetch. Recommendation:
# min_after_dequeue + (num_threads + a safety margin) * batch_size
min_after_dequeue = 1000
saftey_margin = 3
capacity = min_after_dequeue + (3 + saftey_margin) * batch_size
capacity = 2000
before_image = tf.Print(before_image,[tf.reduce_mean(before_image + after_image)], "pre_shuffle: ")
mask_image = tf.Print(mask_image, [tf.reduce_mean(mask_image)], "pre_shuffle_mask: ")
before_images, after_images, mask_images = (
tf.train.shuffle_batch(
[before_image, after_image, mask_image], batch_size=batch_size,
capacity=capacity, min_after_dequeue=min_after_dequeue,
num_threads=5, seed=1))
before_images = tf.Print(before_images,[tf.reduce_mean(before_images + after_images)], "post_shuffle: ")