我正尝试使用multi-task deep neural network构建xlm roberta large model来解决多语言分类问题。我的训练数据集包含4列:
ID
评论文本(根据ID号,每个用户的英语评论都存储在此列中。示例评论:“您是失败者”)
有毒(此列包含1 / 0,0表示无毒,1表示有毒)
personal_attack(此列还包含0/1,,0表示该注释不是人身攻击类型的注释,而1表示相反)
这是我的模型代码:
def build_model(transformer, max_len=512):
input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
sequence_output = transformer(input_word_ids)[0]
cls_token = sequence_output[:, 0, :]
out = Dense(1, activation='sigmoid',name = 'y_train')(cls_token)
out1 = Dense(1, activation='sigmoid',name = 'y_aux')(cls_token)
model = Model(inputs=input_word_ids, outputs=[out, out1])
model.compile(Adam(lr=1e-5), loss='binary_crossentropy', metrics=['accuracy'])
return model
这是训练和测试数据集的代码:
train_dataset = (
tf.data.Dataset
.from_tensor_slices((x_train,
{ 'y_train':train.toxic.values,
'y_aux':train.identity_attack.values}
))
.repeat()
.shuffle(2048)
.batch(BATCH_SIZE)
.prefetch(AUTO)
)
test_dataset = (
tf.data.Dataset
.from_tensor_slices(x_test)
.batch(BATCH_SIZE)
)
然后在训练模型中我使用以下代码:
EPOCHS = 3
n_steps = x_train.shape[0] // BATCH_SIZE
train_history = model.fit(
train_dataset,
steps_per_epoch=n_steps,
epochs=EPOCHS
)
我不希望执行验证,因此只给model.fit()提供了train_dataset
3个时期后,我的表现如下:
Epoch 3/3
1658/1658 [==============================] - 887s 535ms/step - loss: 0.0591 - y_train_loss: 0.0175 - y_aux_loss: 0.0416 - y_train_accuracy: 0.9940 - y_aux_accuracy: 0.9821
现在在我的测试集中,我有1列:
所以我希望我的模型能在这些测试集上预测给定的测试集注释是否有毒? 从第3个时期的结果可以看出,我正在计算y_train_accuracy:0.9940-y_aux_accuracy:0.9821
现在我希望我的模型仅预测y_test或有毒/无毒
为此,我尝试过:
sub['toxic'] = model.predict(test_dataset, verbose=1)
sub是一个数据框,其中包含测试集的所有ID,并使用 test_dataset 我试图预测每个测试集的注释,但出现此错误:
499/499 [==============================] - 126s 253ms/step
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-23-1dc84858379e> in <module>
----> 1 sub['toxic'] = model.predict(test_dataset, verbose=1)
2 sub.to_csv('submission.csv', index=False)
/opt/conda/lib/python3.7/site-packages/pandas/core/frame.py in __setitem__(self, key, value)
2936 else:
2937 # set column
-> 2938 self._set_item(key, value)
2939
2940 def _setitem_slice(self, key, value):
/opt/conda/lib/python3.7/site-packages/pandas/core/frame.py in _set_item(self, key, value)
2998
2999 self._ensure_valid_index(value)
-> 3000 value = self._sanitize_column(key, value)
3001 NDFrame._set_item(self, key, value)
3002
/opt/conda/lib/python3.7/site-packages/pandas/core/frame.py in _sanitize_column(self, key, value, broadcast)
3634
3635 # turn me into an ndarray
-> 3636 value = sanitize_index(value, self.index, copy=False)
3637 if not isinstance(value, (np.ndarray, Index)):
3638 if isinstance(value, list) and len(value) > 0:
/opt/conda/lib/python3.7/site-packages/pandas/core/internals/construction.py in sanitize_index(data, index, copy)
609
610 if len(data) != len(index):
--> 611 raise ValueError("Length of values does not match length of index")
612
613 if isinstance(data, ABCIndexClass) and not copy:
ValueError: Length of values does not match length of index
现在我有4个问题:
我的实现正确吗?
为什么会出现该错误?如果我将此问题视为简单的多语言分类任务,例如为y计算出1个损失,那么我完全没有错误,那么我在哪里遇到麻烦?
我该如何解决这个问题?
因为这是我第一次使用拥抱变形器进行多任务学习,您对更新我的模型有何建议?