首先我按如下方式创建标记器
from tokenizers import Tokenizer
from tokenizers.models import BPE,WordPiece
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
from tokenizers.trainers import BpeTrainer,WordPieceTrainer
trainer = WordPieceTrainer(vocab_size=5000,min_frequency=3,
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
from tokenizers.pre_tokenizers import Whitespace,WhitespaceSplit
tokenizer.pre_tokenizer = WhitespaceSplit()
tokenizer.train(files, trainer)
from tokenizers.processors import TemplateProcessing
tokenizer.token_to_id("[SEP]"),tokenizer.token_to_id("[CLS]")
tokenizer.post_processor = TemplateProcessing(
single="[CLS] $A [SEP]",
pair="[CLS] $A [SEP] $B:1 [SEP]:1",
special_tokens=[
("[CLS]", tokenizer.token_to_id("[CLS]")),
("[SEP]", tokenizer.token_to_id("[SEP]")),
],
)
接下来,我想在这些令牌上训练 BERT 模型。我试过如下
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer,mlm=True, mlm_probability=0.15)
但它给了我一个错误
AttributeError: 'tokenizers.Tokenizer' object has no attribute 'mask_token'
“这个分词器没有掩码语言建模所必需的掩码标记。”
虽然我有attention_mask
。是不同于 mask token