Question

我正在尝试预测给定长度为100的字符串的下一个字符。问题是，当我在生成训练数据时，我的整个RAM（Amazon AWS上的32 GB-https://aws.amazon.com/marketplace/pp/B077GCH38C?qid=1527197041958&sr=0-1&ref_=srh_res_product_title）被吃掉并处理被杀死。

要构建训练数据，我要遍历文章列表（每篇文章有500-1'000个字符）。在每篇文章中，我将前100个字符作为输入，然后将下一个字符作为输出，然后移动一个字符并将其重复至文本结尾。这种方法会产生很多训练向量，即具有500个字符的文章将产生约400个测试数据，这是有问题的。

有15,000条文章和100个滑动窗口，将有数百万个培训数据，而我的AWS机器（具有32 GB RAM t2.2xlarge-https://aws.amazon.com/marketplace/pp/B077GCH38C?qid=1527197041958&sr=0-1&ref_=srh_res_product_title）正在以大约79％的速度消失-3500万个培训数据。

所以我的问题是-Keras中是否有一种方法可以开始学习模型（比如说25％的数据），然后再加载下25％的数据并执行此操作，直到所有内容都消耗完了？

我的学习伪代码：

with open(articles_path, 'rt', encoding="UTF-8") as file:
    for line in file:
        article = line[0:-1]
        article_length = len(article)
        # here is the problematic code 
        for i in range(0, article_length - seq_length, 1):
            seq_in = article[i:i + seq_length]
            seq_out = article[i + seq_length]
            dataX.append([tokens[char] for char in seq_in])
            dataY.append(tokens[seq_out])

model = Sequential()
model.add(LSTM(256, input_shape=(seq_length, 1)))
model.add(Dropout(0.2))
model.add(Dense(len(tokens), activation=activation))
model.compile(loss=loss, optimizer=optimizer)

model.fit(X, y, epochs=epochs, batch_size=batch_size, callbacks=callbacks_list)

注意：当我编写自己的程序时，我正在使用本教程https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/

Answer 1

您的数据生成方法很有趣，但是您不必从文本中生成每 100字节的样本。用以下内容替换有问题的代码：

    for i in range(0, article_length - seq_length, 1):
        if random.randint(1,10) not in [5, 6] : continue   # this will skip 80% of the samples
        seq_in = article[i:i + seq_length]
        seq_out = article[i + seq_length]
        dataX.append([tokens[char] for char in seq_in])
        dataY.append(tokens[seq_out])

将import random放在文件开头附近。一旦将其放入代码中，五分之一的序列将进入训练数据，有效地减小了大小。

有一种方法可以使生成随机采样的字符串的效率更高，但这将需要重写代码，而这种方法只会增加一行。

Answer 2

这似乎是切换到生成器的好时机，实质上，您将一次吐出一批而不是加载整个数据集：

def data_gen(batch_size=32):
  """Yield single batch at a time."""
  dataX, dataY = list(), list()
  while True: # the generator yields forever
    # here is the problematic code 
    for i in range(0, article_length - seq_length, 1):
      for _ in range(batch_size):
        seq_in = article[i:i + seq_length]
        seq_out = article[i + seq_length]
        dataX.append([tokens[char] for char in seq_in])
        dataY.append(tokens[seq_out])
      yield np.array(dataX), np.array(dataY)
      dataX, dataY = list(), list()

您现在可以使用fit_generator（ref）进行训练，这将从您的发电机中提取批次。因此，您只处理batch_size个样本，而不处理整个样本。您可能要使用NumPy数组而不是Python列表。

要获得更井井有条的版本，您可以实现Sequence class来封装数据并充当生成器。

在参与方数据上训练Keras模型

2 个答案: