Question

我正在尝试为基于单词的文本生成训练模型。我建立了数据，以便每10个单词预测一个单词。当我使用keras_to.categorical函数对它进行一次热编码时，我的数据集的样本量非常大（大约100万，vocab_size为10000），并且消耗了大约16gb的ram。因此，我决定编写一个自定义生成器，将数据分成若干批，以供在fit_generator中使用。该模型似乎训练得很好，损失降低了大约1。但是，当我尝试根据示例文本生成文本时，它似乎重复了很多单词。我知道在我的生成文本函数中没有什么错，因为当我整体使用标签数据（一个消耗16gb的ram）时它可以正常工作，但是当我分批处理时却无法正常工作。我的自定义生成器函数有问题。我也想说，这是我第一次尝试使用自定义生成器，因此可能我不了解。

#Function to separate data to batches
def text_generator(X,y,vocab_size, batch_size = 64):

    while True:
      X_batch=np.array_split(X,batch_size)
      y_batch=np.array_split(y,batch_size)          



      for i in range(len(X_batch)):              
        batch_input = []
        batch_output = [] 
        input = X_batch[i]
        output = y_batch[i]

        output = tensorflow.keras.utils.to_categorical(output,vocab_size)
        batch_input += [ input[0] ]
        batch_output += [ output[0] ]                   
        batch_x = np.array(batch_input)
        batch_y = np.array(batch_output)

        yield( batch_x, batch_y )

#To train using the custom generator
model.fit_generator(text_generator(X_pad,y,vocab_size,128),steps_per_epoch=1024,epochs=2,verbose=1,callbacks=[checkpoint])

经过大约10个时期的训练，看起来像这样

Epoch 2/2
1023/1024 [============================>.] - ETA: 0s - loss: 1.1588 - acc: 0.6422
Epoch 00002: acc improved from 0.59766 to 0.64258, saving model to weights/weights-improvement-02-0.64.hdf5
1024/1024 [==============================] - 70s 68ms/step - loss: 1.1585 - acc: 0.6426

<tensorflow.python.keras.callbacks.History at 0x7f57060ac518>

对于某些随机种子文本，生成的文本如下

in in in in and in in make orbit one one one a a a a a one the mainly mainly to to .

我的X_pad和y看起来像这样

X_pad[0]=[   1 3558    3 2119 4055  155    0    0    0    0]
y[0]=[12]

有人可以帮我弄清楚我在做什么错吗？

keras自定义生成器将数据拆分为批次时出了问题

0 个答案: