在fit_generator中使用generator时出现内存不足错误

时间:2020-03-24 08:27:47

标签: python tensorflow keras generator lstm

在Google Colab上使用100,000个句子(列表形式)训练我的LSTM网络时。我的程序由于“内存不足”错误而崩溃。

为解决此问题,我创建了一个Generator函数,将10万个句子列表划分为5个批次的20,000个句子,然后使用Keras模型函数model.fit_generator训练这些批次。但是,在再次加载第二批2万句时,Google Colab崩溃并重新开始显示“ Out of Memory Error”。

使用model.fit函数对20,000个句子进行训练时,我的模型预测良好。

在错误分析期间,我发现,在为第二批20,000个句子创建np.arrayto_categorical时,程序崩溃了。同样,批次1的先前阵列位于内存中。

对于20,000个句子,它在Colab上使用13 GB的GPU RAM。

生成器功能:

 def generator():
   start = 0
   end = 0.20
   while (end <= 1):
     sentences = sentence_list[int(len(sentence_list)*start):int(len(sentence_list)*end)]
     tokens = []

     print("Total Number of sentences passed : ", len(sentences))
     print("sentences are : ", sentences[:30])


     for sentence in sentences:
       words = sentence.split(" ")
       tokens += words

     start += 0.20
     end += 0.20
     train_len = 3+1
     text_sequences = []
     for i in range(train_len,len(tokens)):
       seq = tokens[i-train_len:i]
       text_sequences.append(seq)

     sequences = {}
     count = 1
     for i in range(len(tokens)):
       if tokens[i] not in sequences:
         sequences[tokens[i]] = count
         count += 1

     tokenizer = Tokenizer()
     tokenizer.fit_on_texts(text_sequences)
     sequences = tokenizer.texts_to_sequences(text_sequences)

     #Collecting some information   
     unique_words = tokenizer.index_word
     unique_wordsApp = tokenizer.word_counts
     #vocabulary_size = len(tokenizer.word_counts)

     print("vocabulary_size: ", vocabulary_size)

     n_sequences = np.empty([len(sequences),train_len], dtype='float32')
     for i in range(len(sequences)):
       n_sequences[i] = sequences[i]

     #print("n_sequences: ", n_sequences)

     train_inputs = n_sequences[:,:-1]
     train_targets = n_sequences[:,-1]

     #print("train_inputs =", train_inputs)
     print("train_targets =", train_targets)

     train_targets = to_categorical(train_targets, num_classes=vocabulary_size+1, dtype='float32')

     print("to_categorical Done")

     seq_len = train_inputs.shape[1]
     train_inputs.shape

     yield train_inputs, train_targets


     train_inputs = None
     train_targets = None
     print("DONE")

生成和训练模型:

training_generator = generator()
checkpoint = ModelCheckpoint('one_lac_word_pred_Model4.h5', monitor='loss', verbose=1, save_best_only=True, mode='min')
print("DONE TILL CHECKPOINT")
model.fit_generator(generator=training_generator, steps_per_epoch = 1,  epochs=500, verbose=1, callbacks=[checkpoint], workers=0)
print("MODEL FITTING DONE")
model.save('one_lac_word_pred_Model4.h5')
dump(tokenizer,open('one_lac_tokenizer_Model4','wb'))

我找不到解决方案,如何继续生成批处理,而在每次批处理后保持可用的内存。

0 个答案:

没有答案