训练损失在减少,但准确率保持不变

时间:2021-01-17 09:01:10

标签: deep-learning nlp pytorch classification bert-language-model

这是使用 Roberta (BERT) 进行多标签分类任务的训练和开发单元。第一部分是培训,第二部分是开发(验证)。 train_dataloader 是我的火车数据集,dev_dataloader 是开发数据集。我的问题是:为什么训练损失在逐步减少,但准确率却没有增加那么多?实际上,在迭代 4 之前,准确度一直在增加,但在最后一个时期(迭代)之前,列车损失一直在减少。这是可以的还是应该有问题?

train_loss_set = []
iterate = 4
for _ in trange(iterate, desc="Iterate"):
  model.train()

  train_loss = 0 
  nu_train_examples, nu_train_steps = 0, 0
  
  for step, batch in enumerate(train_dataloader):
    batch = tuple(t.to(device) for t in batch)
    batch_input_ids, batch_input_mask, batch_labels = batch
    optimizer.zero_grad()
    output = model(batch_input_ids, attention_mask=batch_input_mask)
    logits = output[0]
    loss_function = BCEWithLogitsLoss() 
    loss = loss_function(logits.view(-1,num_labels),batch_labels.type_as(logits).view(-1,num_labels))
    train_loss_set.append(loss.item())    
    loss.backward()
    optimizer.step()
    train_loss += loss.item()
    nu_train_examples += batch_input_ids.size(0)
    nu_train_steps += 1

  print("Train loss: {}".format(train_loss/nu_train_steps))

###############################################################################

  model.eval()
  logits_pred,true_labels,pred_labels,tokenized_texts = [],[],[],[]

  # Predict
  for i, batch in enumerate(dev_dataloader):
    batch = tuple(t.to(device) for t in batch)
    batch_input_ids, batch_input_mask, batch_labels = batch
    with torch.no_grad():
      out = model(batch_input_ids, attention_mask=batch_input_mask)
      batch_logit_pred = out[0]
      pred_label = torch.sigmoid(batch_logit_pred)
      batch_logit_pred = batch_logit_pred.detach().cpu().numpy()
      pred_label = pred_label.to('cpu').numpy()
      batch_labels = batch_labels.to('cpu').numpy()

    tokenized_texts.append(batch_input_ids)
    logits_pred.append(batch_logit_pred)
    true_labels.append(batch_labels)
    pred_labels.append(pred_label)

  pred_labels = [item for sublist in pred_labels for item in sublist]
  true_labels = [item for sublist in true_labels for item in sublist]
  threshold = 0.4
  pred_bools = [pl>threshold for pl in pred_labels]
  true_bools = [tl==1 for tl in true_labels]
  
  print("Accuracy is: ", jaccard_score(true_bools,pred_bools,average='samples'))
torch.save(model.state_dict(), 'bert_model')

和输出:

Iterate:   0%|          | 0/10 [00:00<?, ?it/s]

Train loss: 0.4024542534684801

/usr/local/lib/python3.6/dist-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning: Jaccard is ill-defined and being set to 0.0 in samples with no true or predicted labels. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))

Accuracy is:  0.5806403013182674

Iterate:  10%|█         | 1/10 [03:21<30:14, 201.64s/it]

Train loss: 0.2972540049911379
Accuracy is:  0.6091337099811676

Iterate:  20%|██        | 2/10 [06:49<27:07, 203.49s/it]

Train loss: 0.26178574864264137
Accuracy is:  0.608361581920904

Iterate:  30%|███       | 3/10 [10:17<23:53, 204.78s/it]

Train loss: 0.23612180122962365
Accuracy is:  0.6096717783158462

Iterate:  40%|████      | 4/10 [13:44<20:33, 205.66s/it]

Train loss: 0.21416303515434265
Accuracy is:  0.6046892655367231

Iterate:  50%|█████     | 5/10 [17:12<17:11, 206.27s/it]

Train loss: 0.1929110718982203
Accuracy is:  0.6030885122410546

Iterate:  60%|██████    | 6/10 [20:40<13:46, 206.74s/it]

Train loss: 0.17280191068465894
Accuracy is:  0.6003766478342749

Iterate:  70%|███████   | 7/10 [24:08<10:21, 207.04s/it]

Train loss: 0.1517329115446631
Accuracy is:  0.5864783427495291

Iterate:  80%|████████  | 8/10 [27:35<06:54, 207.23s/it]

Train loss: 0.12957811209705325
Accuracy is:  0.5818832391713747

Iterate:  90%|█████████ | 9/10 [31:03<03:27, 207.39s/it]

Train loss: 0.11256680189521162
Accuracy is:  0.5796045197740114

Iterate: 100%|██████████| 10/10 [34:31<00:00, 207.14s/it]

1 个答案:

答案 0 :(得分:0)

训练损失正在减少,因为您的模型会逐渐学习您的训练集。评估准确度是模型学习训练集全局特征的程度以及模型预测“看不见的数据”的程度。因此,如果损失在减少,则您的模型正在学习。也许它从训练集中学到了太具体的信息,实际上是过度拟合了。这意味着它“太适合”训练数据并且无法对看不见的数据做出正确的预测,因为测试数据可能略有不同。这就是为什么评估精度不再提高的原因。

这可能是一种解释。