我想问一下使用张量流批量读取大文本数据的正确模式是什么?
这是一行文本数据。单个txt文件中有数十亿行此类数据。
target context label
现在我正在尝试使用官方文档中推荐的tfrecords。
这是我的方式
filename_queue = tf.train.string_input_producer([self._train_data], num_epochs=self._num_epochs)
reader = tf.TFRecordReader()
_, serialized_example = reader.read(filename_queue)
features = tf.parse_single_example(
serialized_example,
# Defaults are not specified since both keys are required.
features={
'target': tf.FixedLenFeature([], tf.int64),
'context': tf.FixedLenFeature([], tf.int64),
'label': tf.FixedLenFeature([], tf.int64),
})
target = features['target']
context = features['context']
label = features['label']
min_after_dequeue = 10000
capacity = min_after_dequeue + 3 * self._batch_size
target_batch, context_batch, label_batch = tf.train.shuffle_batch(
[target, context, label], batch_size=self._batch_size, capacity=capacity,
min_after_dequeue=min_after_dequeue, num_threads=self._concurrent_steps)
之后我使用时间线进行分析。结果表明这部分大部分时间都在使用。 这是剖析图。 the profiling result
顺便说一下。我使用批量500。 有什么建议吗?
答案 0 :(得分:0)
将tf.parse_example()
应用于一批元素通常比对每个元素应用tf.parse_single_example()
更有效,因为前一个op具有高效的多线程实现,可在输入包含时使用多个例子。以下重写代码应该可以提高性能:
filename_queue = tf.train.string_input_producer([self._train_data], num_epochs=self._num_epochs)
reader = tf.TFRecordReader()
# Read a batch of up to 128 examples at once.
_, serialized_examples = reader.read_up_to(filename_queue, 128)
features = tf.parse_example(
serialized_examples,
# Defaults are not specified since both keys are required.
features={
'target': tf.FixedLenFeature([], tf.int64),
'context': tf.FixedLenFeature([], tf.int64),
'label': tf.FixedLenFeature([], tf.int64),
})
target = features['target']
context = features['context']
label = features['label']
min_after_dequeue = 10000
capacity = min_after_dequeue + 3 * self._batch_size
# Pass `enqueue_many=True` because the input is now a batch of parsed examples.
target_batch, context_batch, label_batch = tf.train.shuffle_batch(
[target, context, label], batch_size=self._batch_size, capacity=capacity,
min_after_dequeue=min_after_dequeue, num_threads=self._concurrent_steps,
enqueue_many=True)