Question

我正在使用Python Keras的fit_generator()在3000万行文件（每行是一个示例）上训练深度学习模型，该模型实现了批量训练。由于样本的大小差异很大，出于效率考虑，我使用了存储桶，以避免批次稀疏。

为此，我通过增加行大小来对文件进行排序，并编写了一个生成器，该生成器会在行中进行迭代，并在每次处理n_batch行时产生一批：

def generate_batches_from_file():
while True:        
    doc_list = []
    with open(path_to_file) as docs: 
        for my_counter,doc in enumerate(docs):
            doc_list.append(doc)            
            if my_counter % batch_size == 0:
                doc_array = truncation_padding_other_stuff(doc_list)
                yield(doc_array)

这样，每批中的样本大小相等或非常相似，并且张量密集。

不过，在深度学习中，最佳实践要求批次不要在每个时期都以相同的顺序传递到模型（出于正则化目的）。

既然我是在动态生成批处理，而且又不得不逐行处理大型分类的输入文件以进行存储，那么如何在我的设置中对批处理进行混洗？

请注意，我不想在每个批次中混洗样本，我希望在每个时期以不同的顺序传递批次。

Answer 1

为简单起见，假设您有1000个训练样本（用n_samples表示），并且批次大小设置为10。对行进行了排序，因此批次为：(docs[0],...,docs[9]), (docs[10],...,docs[19]),...,(docs[990],...,docs[999])。因此，要以不同的顺序生成批次，您可以轻松地存储每个批次的起始索引，然后在每个时期的开始对其进行随机排序：

import numpy as np

def generate_batches_from_file():
    indices = np.arange(0, n_samples, batch_size) 
    while True:
        # shuffle the indices each time to generate batches in different order
        np.random.shuffle(indices)
        with open(path_to_file) as docs:
            for idx in indices:
                doc_array = truncation_padding_other_stuff(docs[idx:idx+batch_size])
                yield(doc_array)

随机批次fit_generator Keras逐行处理

1 个答案: