我读过某个地方,您不应在验证集上使用数据增强,而应该仅在训练集上使用它。
我的问题是这样
我有一个训练样本数量较少的数据集,我想使用数据增强。 我将数据集分为训练集和测试集,并在训练集上使用数据扩充。然后,我在训练集上使用StratifiedKfold,该训练集返回了一个训练索引和一个测试索引,但是如果我使用X_train [test index]作为我的验证集,则它具有一些增强的图像,我不希望这样做。 >
有什么方法可以对训练集进行数据扩充和交叉验证吗?
这是我的代码(我还没有进行数据扩充,但是很想找到一种方法,将test_index与扩充的训练样本分开。)
kfold = StratifiedKFold(n_splits=5,shuffle=True)
i=1
for train_index , test_index in kfold.split(X_train,y_train):
dataset_train = tf.data.Dataset.from_tensor_slices((X_train[train_index],
y_train.iloc[train_index])).shuffle(len(X_train[train_index]))
dataset_train = dataset_train.batch(512,drop_remainder=True).repeat()
dataset_test = tf.data.Dataset.from_tensor_slices((X_train[test_index],
y_train.iloc[test_index])).shuffle(len(X_train[test_index]))
dataset_test = dataset_test.batch(32,drop_remainder=True).take(steps_per_epoch).repeat()
model_1 = deep_neural()
print('--------------------------------------------------------------------------------------------
-------------------------------')
print('\n')
print(f'Training for fold {i} ...')
print('Training on {} samples.........Validating on {} samples'.format(len(X_train[train_index]),
len(X_train[test_index])))
checkpoint = tf.keras.callbacks.ModelCheckpoint(get_model_name(i),
monitor='val_loss', verbose=1,
save_best_only=True, mode='min')
history = model_1.fit(dataset_train,steps_per_epoch = len(X_train[train_index])//BATCH_SIZE,
epochs=4,validation_data=dataset_test,
validation_steps=1,callbacks=[csv_logger,checkpoint])
scores = model_1.evaluate(X_test,y_test,verbose=0)
pred_classes = model_1.predict(X_test).argmax(1)
f1score = f1_score(y_test,pred_classes,average='macro')
print('\n')
print(f'Score for fold {i}: {model_1.metrics_names[0]} of {scores[0]}; {model_1.metrics_names[1]}
of {scores[1]*100}; F1 Score of {f1score}%')
print('\n')
acc_per_fold.append(scores[1] * 100)
loss_per_fold.append(scores[0])
f1score_per_fold.append(f1score)
tf.keras.backend.clear_session()
gc.collect()
del model_1
i=i+1