我最近将我的建模框架切换为使用自定义Tensorflow Estimators和Datasets,并且对此工作流程非常满意。
但是,我刚刚注意到我的dataset_input_fn如何从tfrecords加载数据的问题。我的输入函数是在Tensorflow文档中的example之后建模的。当我有更多的例子而不是我可以适应RAM时,会出现问题。如果我有1e6个示例,并将我的shuffle buffer_size设置为1e5,则选择1e5示例的子集一次,随机,然后迭代。这意味着我的模型仅在我的整个数据集的10%上进行训练。设置此行为的代码完全来自Tensorflow documentation example code:
dataset = dataset.map(parser)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(32)
dataset = dataset.repeat(num_epochs)
iterator = dataset.make_one_shot_iterator()
我的问题:当我训练时,是否可以在初始1e5之外用新示例填充shuffle缓冲区? one_shot_iterator是否支持此类功能?我需要使用可初始化的迭代器吗?
谢谢!
答案 0 :(得分:5)
I have found what appears to be a tenable workaround for now. Through some experimentation, I learned that when instantiating a TFRecordDataset,
filenames = ["file1.tfrecord", ..., "filen.tfrecord"]
dataset = tf.data.TFRecordDataset(filenames)
and setting up a shuffle buffer:
dataset = dataset.shuffle(buffer_size=10000)
the buffer is only populated with the first 10000 examples from however many tf records that requires. For example, in my case, I have ~300 tfrecord files containing 4096 examples each. On examination, my shuffle buffer appears to consists only of examples from the first 3 tf records in my filenames list. Since my filenames list is static, this means that my model is only trained of my first 3 tfrecords!
My fix for now is pretty simple. In my training loop I already alternate between Estimator.train and Estimator.evaluate, and I noticed that each time I call Estimator.train, the shuffle buffer is repopulated. My solution then is to shuffle my filenames each time my input_fn is called. This is not a very elegant solution, but does achieve the desired effect of allowing my to iterate across all tfrecords.
#My Crappy Fix: shuffle file names in input_fn
np.random.shuffle(filenames)
dataset = tf.data.TFRecordDataset(filenames)
What's annoying about this solution (aside from its kludginess) is that my minibatches are not "globally random". Rather, they are selected form a small subset of tf records, and only that subset is used for each training/evaluation cycle. One way to mitigate this is to increase my shuffle buffer size or decrease my tfrecord size, I'll probably do both of these. Finally, I think it's worth noting that if
shuffle_buffer_size < (tf_record_size + minibatch_size)
then, as far as I can tell, my TFRecordDataset will pull from a single tfrecord file!
Finally, I don't think the relevant tensorflow documentation conveys these complexities well. The documentation alludes to the ability to train on large datasets that don't fit into memory, but doesn't provide much detail. It seems unlikely that the tf authors had in mind my hacky strategy when writing this, so I remain curious to see if there's a better approach.