在Google Colab上使用100,000个句子(列表形式)训练我的LSTM网络时。我的程序由于“内存不足”错误而崩溃。
为解决此问题,我创建了一个Generator函数,将10万个句子列表划分为5个批次的20,000个句子,然后使用Keras模型函数model.fit_generator
训练这些批次。但是,在再次加载第二批2万句时,Google Colab崩溃并重新开始显示“ Out of Memory Error”。
使用model.fit
函数对20,000个句子进行训练时,我的模型预测良好。
在错误分析期间,我发现,在为第二批20,000个句子创建np.array
和to_categorical
时,程序崩溃了。同样,批次1的先前阵列位于内存中。
对于20,000个句子,它在Colab上使用13 GB的GPU RAM。
生成器功能:
def generator():
start = 0
end = 0.20
while (end <= 1):
sentences = sentence_list[int(len(sentence_list)*start):int(len(sentence_list)*end)]
tokens = []
print("Total Number of sentences passed : ", len(sentences))
print("sentences are : ", sentences[:30])
for sentence in sentences:
words = sentence.split(" ")
tokens += words
start += 0.20
end += 0.20
train_len = 3+1
text_sequences = []
for i in range(train_len,len(tokens)):
seq = tokens[i-train_len:i]
text_sequences.append(seq)
sequences = {}
count = 1
for i in range(len(tokens)):
if tokens[i] not in sequences:
sequences[tokens[i]] = count
count += 1
tokenizer = Tokenizer()
tokenizer.fit_on_texts(text_sequences)
sequences = tokenizer.texts_to_sequences(text_sequences)
#Collecting some information
unique_words = tokenizer.index_word
unique_wordsApp = tokenizer.word_counts
#vocabulary_size = len(tokenizer.word_counts)
print("vocabulary_size: ", vocabulary_size)
n_sequences = np.empty([len(sequences),train_len], dtype='float32')
for i in range(len(sequences)):
n_sequences[i] = sequences[i]
#print("n_sequences: ", n_sequences)
train_inputs = n_sequences[:,:-1]
train_targets = n_sequences[:,-1]
#print("train_inputs =", train_inputs)
print("train_targets =", train_targets)
train_targets = to_categorical(train_targets, num_classes=vocabulary_size+1, dtype='float32')
print("to_categorical Done")
seq_len = train_inputs.shape[1]
train_inputs.shape
yield train_inputs, train_targets
train_inputs = None
train_targets = None
print("DONE")
生成和训练模型:
training_generator = generator()
checkpoint = ModelCheckpoint('one_lac_word_pred_Model4.h5', monitor='loss', verbose=1, save_best_only=True, mode='min')
print("DONE TILL CHECKPOINT")
model.fit_generator(generator=training_generator, steps_per_epoch = 1, epochs=500, verbose=1, callbacks=[checkpoint], workers=0)
print("MODEL FITTING DONE")
model.save('one_lac_word_pred_Model4.h5')
dump(tokenizer,open('one_lac_tokenizer_Model4','wb'))
我找不到解决方案,如何继续生成批处理,而在每次批处理后保持可用的内存。