我的数据集很大,无法容纳在RAM中。我的假设是将数据集划分为大块并迭代地训练模型,并看到与非划分数据集类似的精确度结果(如#107中所述)。因此,我分两步测试了146244个元素的样本集。
以下是创建和拟合模型的代码:
model = Sequential()
model.add(Flatten(input_shape=train_data.shape[1:]))
model.add(Dense(256, activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='sigmoid'))
optimizer = optimizers.adam(lr=0.0001)
model.compile(optimizer=optimizer,
loss='categorical_crossentropy',
metrics=['accuracy'])
model.save(top_model_path)
for train_data, train_labels in hdf5_generator('D:/_g/kaggle/cdisco/files/features_xception_1_100_90x90_tr1.hdf5', 'train', train_labels):
model = load_model(top_model_path)
model.fit(train_data, train_labels, epochs=2, batch_size=batch_size)
model.save(top_model_path)
(eval_loss, eval_accuracy) = model.evaluate(validation_data, validation_labels, batch_size=batch_size, verbose=1)
print("Accuracy: {:.2f}%".format(eval_accuracy * 100))
print("Loss: {}\n".format(eval_loss))
而且,这是生成器的代码:
def hdf5_generator(file_name, dataset, labels):
hf = h5py.File(file_name, 'r')
i = 0
batch_size = 50000
while 1:
start = i * batch_size
end = ((i + 1) * batch_size)
print("start: %d -- end: %d" % (start, end))
yield (hf[dataset][start:end], labels[start:end])
if end >= len(hf[dataset]): break
i += 1
首先尝试使用分块数据导致步骤之间的准确性不稳定(22,90%,59,93%,51,17%):
start: 0 -- end: 50000
Epoch 1/2
50000/50000 [==============================] - 39s - loss: 2.4969 - acc: 0.5143
Epoch 2/2
50000/50000 [==============================] - 38s - loss: 1.5667 - acc: 0.6201
16156/16156 [==============================] - 33s
Accuracy: 22.90%
Loss: 5.976185762991436
start: 50000 -- end: 100000
Epoch 1/2
50000/50000 [==============================] - 38s - loss: 1.3759 - acc: 0.7211
Epoch 2/2
50000/50000 [==============================] - 39s - loss: 0.5446 - acc: 0.8621
16156/16156 [==============================] - 34s
Accuracy: 59.93%
Loss: 2.540657121840312
start: 100000 -- end: 150000
Epoch 1/2
46244/46244 [==============================] - 36s - loss: 0.2640 - acc: 0.9531
Epoch 2/2
46244/46244 [==============================] - 36s - loss: 0.1283 - acc: 0.9672
16156/16156 [==============================] - 34s
Accuracy: 51.17%
Loss: 3.8107337748964336
第二次尝试(在hdf5_generator中使用batch_size = 146244)导致77.49%的准确率:
start: 0 -- end: 146244
Epoch 1/2
146244/146244 [==============================] - 112s - loss: 1.8089 - acc: 0.6123
Epoch 2/2
146244/146244 [==============================] - 113s - loss: 1.0966 - acc: 0.7265
16156/16156 [==============================] - 39s
Accuracy: 77.49%
Loss: 0.8401890944788202
我希望看到类似的准确度结果。然而,结果是不同的,并且第一个结果似乎模型在迭代之间失去参数。如何通过重新拟合获得与具有完整数据的单一拟合类似的分块数据集的结果?
我尝试过使用HDF5Matrix,但性能非常低。