在NLP中如何同时进行数据增强和交叉验证?

时间:2020-08-12 18:28:59

标签: python tensorflow machine-learning nlp cross-validation

我读过某个地方,您不应在验证集上使用数据增强,而应该仅在训练集上使用它。

我的问题是这样

我有一个训练样本数量较少的数据集,我想使用数据增强。 我将数据集分为训练集和测试集,并在训练集上使用数据扩充。然后,我在训练集上使用StratifiedKfold,该训练集返回了一个训练索引和一个测试索引,但是如果我使用X_train [test index]作为我的验证集,则它具有一些增强的图像,我不希望这样做。 >

有什么方法可以对训练集进行数据扩充和交叉验证吗?

这是我的代码(我还没有进行数据扩充,但是很想找到一种方法,将test_index与扩充的训练样本分开。)

kfold = StratifiedKFold(n_splits=5,shuffle=True)
i=1
for train_index , test_index in kfold.split(X_train,y_train):

  dataset_train = tf.data.Dataset.from_tensor_slices((X_train[train_index],                                                        
  y_train.iloc[train_index])).shuffle(len(X_train[train_index]))
  dataset_train = dataset_train.batch(512,drop_remainder=True).repeat()
  dataset_test = tf.data.Dataset.from_tensor_slices((X_train[test_index],                                                     
  y_train.iloc[test_index])).shuffle(len(X_train[test_index]))
  dataset_test = dataset_test.batch(32,drop_remainder=True).take(steps_per_epoch).repeat()

  model_1 = deep_neural()

  print('-------------------------------------------------------------------------------------------- 
  -------------------------------')
  print('\n')
  print(f'Training for fold {i} ...')
  print('Training on {} samples.........Validating on {} samples'.format(len(X_train[train_index]),
                                                                       len(X_train[test_index])))
  checkpoint = tf.keras.callbacks.ModelCheckpoint(get_model_name(i), 
                           monitor='val_loss', verbose=1, 
                           save_best_only=True, mode='min')

  history = model_1.fit(dataset_train,steps_per_epoch = len(X_train[train_index])//BATCH_SIZE,
                      epochs=4,validation_data=dataset_test,
                      validation_steps=1,callbacks=[csv_logger,checkpoint]) 

  scores = model_1.evaluate(X_test,y_test,verbose=0)
  pred_classes = model_1.predict(X_test).argmax(1)
  f1score = f1_score(y_test,pred_classes,average='macro') 

  print('\n')
  print(f'Score for fold {i}: {model_1.metrics_names[0]} of {scores[0]}; {model_1.metrics_names[1]} 
  of {scores[1]*100}; F1 Score of {f1score}%')
  print('\n')            
  acc_per_fold.append(scores[1] * 100)
  loss_per_fold.append(scores[0]) 
  f1score_per_fold.append(f1score)

  tf.keras.backend.clear_session()
  gc.collect()     
  del model_1
  i=i+1

0 个答案:

没有答案