变形器Bert模型中从encode()返回的令牌

时间:2020-08-27 01:38:31

标签: python machine-learning nlp bert-language-model huggingface-transformers

我有一个小的情绪分析数据集。分类器将是一个简单的KNN,但我想从Bert库中将词嵌入transformers模型中。请注意,我刚刚发现了这个库-我还在学习。

因此,在在线示例中,我试图了解模型返回的尺寸。

示例:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

tokens = tokenizer.encode(["Hello, my dog is cute", "He is really nice"])
print(tokens)

tokens = tokenizer.encode("Hello, my dog is cute", "He is really nice")
print(tokens)

tokens = tokenizer.encode(["Hello, my dog is cute"])
print(tokens)

tokens = tokenizer.encode("Hello, my dog is cute")
print(tokens)

输出如下:

[101, 100, 100, 102]

[101, 7592, 1010, 2026, 3899, 2003, 10140, 102, 2002, 2003, 2428, 3835, 102]

[101, 100, 102]

[101, 7592, 1010, 2026, 3899, 2003, 10140, 102]

我似乎找不到encode()的文档-我不知道为什么当输入作为列表传递时它为什么返回不同的东西。这是做什么的?

此外,是否有一种方法可以传递单词令牌并重新获得实际单词,以解决上述问题?

提前谢谢

1 个答案:

答案 0 :(得分:0)

您可以致电tokenizer.convert_ids_to_tokens()以获得ID的实际令牌:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

tokens = []

tokens.append(tokenizer.encode(["Hello, my dog is cute", "He is really nice"]))

tokens.append(tokenizer.encode("Hello, my dog is cute", "He is really nice"))

tokens.append(tokenizer.encode(["Hello, my dog is cute"]))

tokens.append(tokenizer.encode("Hello, my dog is cute"))

for t in tokens:
    print(tokenizer.convert_ids_to_tokens(t))

输出:

['[CLS]', '[UNK]', '[UNK]', '[SEP]']
['[CLS]', 'hello', ',', 'my', 'dog', 'is', 'cute', '[SEP]', 'he', 'is', 'really', 'nice', '[SEP]']
['[CLS]', '[UNK]', '[SEP]']
['[CLS]', 'hello', ',', 'my', 'dog', 'is', 'cute', '[SEP]']

正如您在此处看到的那样,每个输入都被标记化,并根据您的模型(伯特)添加了特殊标记。编码函数未正确处理您的列表,这可能是错误或预期的行为,具体取决于您如何定义它,因为它们是批处理batch_encode_plus的方法:

tokenizer.batch_encode_plus(["Hello, my dog is cute", "He is really nice"], return_token_type_ids=False, return_attention_mask=False)

输出:

{'input_ids': [[101, 7592, 1010, 2026, 3899, 2003, 10140, 102], [101, 2002, 2003, 2428, 3835, 102]]}

我不确定为什么未记录编码方法,但可能是拥抱面要我们直接使用call方法的情况:

tokens = []

tokens.append(tokenizer(["Hello, my dog is cute", "He is really nice"],  return_token_type_ids=False, return_attention_mask=False))

tokens.append(tokenizer("Hello, my dog is cute", "He is really nice",  return_token_type_ids=False, return_attention_mask=False))

tokens.append(tokenizer(["Hello, my dog is cute"], return_token_type_ids=False, return_attention_mask=False))

tokens.append(tokenizer("Hello, my dog is cute", return_token_type_ids=False, return_attention_mask=False))

print(tokens)

输出:

[{'input_ids': [[101, 7592, 1010, 2026, 3899, 2003, 10140, 102], [101, 2002, 2003, 2428, 3835, 102]]}, {'input_ids': [101, 7592, 1010, 2026, 3899, 2003, 10140, 102, 2002, 2003, 2428, 3835, 102]}, {'input_ids': [[101, 7592, 1010, 2026, 3899, 2003, 10140, 102]]}, {'input_ids': [101, 7592, 1010, 2026, 3899, 2003, 10140, 102]}]