我正在尝试在包含 15 万个句子的数据集 (wiki103) 上预训练 BERT。在 12 个 epochs 之后,nsp(下一句预测)任务的准确度约为 0.76(如果我继续使用更多的 epochs,则过拟合)和 mlm(掩码语言建模)任务从 0.01 acc 开始,最高达到 0.2。这里有什么问题?我可以在某一时刻停止 nsp 并继续做传销更长时间吗?我的 train loader 长度为 2486(每个 epoch 2486 个训练步骤),这意味着 40*2486=99440 个训练步骤。
这里是模型配置和训练配置
class Train_Config():
""" Hyperparameters for training """
seed: int = 391275 # random seed
batch_size: int = 64
lr: int = 1e-5 # learning rate
n_epochs: int = 40 # the number of epoch
# `warm up` period = warmup(0.1)*total_steps
# linearly increasing learning rate from zero to the specified value(5e-5)
warmup: float = 0.1
is_dibert: bool = False
class Model_Config():
vocab_size: int = 30522 # Size of Vocabulary
hidden_size: int = 768 # Dimension of Hidden Layer in Transformer Encoder
num_hidden_layers: int = 8 # Numher of Hidden Layers
num_attention_heads: int = 8 # Numher of Heads in Multi-Headed Attention Layers
intermediate_size: int = 768 * 4 # Dimension of Intermediate Layers in Positionwise Feedforward Net
# activ_fn: str = "gelu" # Non-linear Activation Function Type in Hidden Layers
max_len: int = 312 # Maximum Length for Positional Embeddings
n_segments: int = 2 # Number of Sentence Segments
attention_probs_dropout_prob: int = 0.1