HuggingFace的BertWordPieceTokenizer与BertTokenizer

时间:2020-06-16 09:19:39

标签: nlp huggingface-transformers bert-language-model huggingface-tokenizers

我有以下代码片段,试图理解BertWordPieceTokenizer和BertTokenizer之间的区别。

BertWordPieceTokenizer(基于锈迹)

from tokenizers import BertWordPieceTokenizer

sequence = "Hello, y'all! How are you Tokenizer ? ?"
tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt")
tokenized_sequence = tokenizer.encode(sequence)
print(tokenized_sequence)
>>>Encoding(num_tokens=15, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

print(tokenized_sequence.tokens)
>>>['[CLS]', 'hello', ',', 'y', "'", 'all', '!', 'how', 'are', 'you', 'token', '##izer', '[UNK]', '?', '[SEP]']

BertTokenizer

from transformers import BertTokenizer
tokenizer = BertTokenizer("bert-base-cased-vocab.txt")
tokenized_sequence = tokenizer.encode(sequence)
print(tokenized_sequence)
#Output: [19082, 117, 194, 112, 1155, 106, 1293, 1132, 1128, 22559, 17260, 100, 136]
  1. 为什么两种编码方式都不同?在BertWordPieceTokenizer中,它提供Encoding对象,而在BertTokenizer中,它提供vocab的ID。
  2. BertWordPieceTokenizer和BertTokenizer的根本区别是什么,因为据我所知,BertTokenizer还在后台使用WordPiece。

谢谢

1 个答案:

答案 0 :(得分:3)

当您使用相同的词汇表时,它们应该产生相同的输出(在您的示例中,您使用的是bert-base-uncased-vocab.txt和bert-base-cased-vocab.txt)。主要区别在于tokenizers包中的令牌生成器比transformers中的令牌生成器更快,因为它们是在Rust中实现的。

修改示例时,您会看到它们产生相同的ids和其他属性(编码对象),而转换器标记程序只产生了ids的列表:

from tokenizers import BertWordPieceTokenizer

sequence = "Hello, y'all! How are you Tokenizer ? ?"
tokenizerBW = BertWordPieceTokenizer("/content/bert-base-uncased-vocab.txt")
tokenized_sequenceBW = tokenizerBW.encode(sequence)
print(tokenized_sequenceBW)
print(type(tokenized_sequenceBW))
print(tokenized_sequenceBW.ids)

输出:

Encoding(num_tokens=15, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])
<class 'Encoding'>
[101, 7592, 1010, 1061, 1005, 2035, 999, 2129, 2024, 2017, 19204, 17629, 100, 1029, 102]
from transformers import BertTokenizer

tokenizerBT = BertTokenizer("/content/bert-base-uncased-vocab.txt")
tokenized_sequenceBT = tokenizerBT.encode(sequence)
print(tokenized_sequenceBT)
print(type(tokenized_sequenceBT))

输出:

[101, 7592, 1010, 1061, 1005, 2035, 999, 2129, 2024, 2017, 19204, 17629, 100, 1029, 102]
<class 'list'>

您在评论中提到,您的问题更多是关于为何产生的输出不同的问题。据我所知,这是开发人员做出的设计决定,没有具体原因。来自tokenizers的BertWordPieceTokenizer也不能替代transformers的BertTokenizer的情况。他们仍然使用包装器使其与transformers标记程序API兼容。有一个BertTokenizerFast类,该类具有“清除”方法_convert_encoding以使BertWordPieceTokenizer完全兼容。因此,您必须将上面的BertTokenizer示例与以下示例进行比较:

from transformers import BertTokenizerFast

sequence = "Hello, y'all! How are you Tokenizer ? ?"
tokenizerBW = BertTokenizerFast.from_pretrained("bert-base-uncased")
tokenized_sequenceBW = tokenizerBW.encode(sequence)
print(tokenized_sequenceBW)
print(type(tokenized_sequenceBW))

输出:

[101, 7592, 1010, 1061, 1005, 2035, 999, 2129, 2024, 2017, 19204, 17629, 100, 1029, 102]
<class 'list'>

从我的角度来看,他们已经独立于tokenizers库构建了transformers库,其目标是快速且有用。