变形金刚和BERT:编码时处理所有格和撇号

时间:2020-04-02 16:18:46

标签: python nlp huggingface-transformers

让我们考虑两个句子:

"why isn't Alex's text tokenizing? The house on the left is the Smiths' house"

现在让我们标记和解码:

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
tokenizer.decode(tokenizer.convert_tokens_to_ids(tokenizer.tokenize("why isn't Alex's text tokenizing? The house on the left is the Smiths' house")))

我们得到:

"why isn't alex's text tokenizing? the house on the left is the smiths'house"

我的问题是如何处理诸如 smiths'house 之类的所有格中缺少的空间?

对我来说,《变形金刚》中的分词处理似乎不正确。让我们考虑

的输出
tokenizer.tokenize("why isn't Alex's text tokenizing? The house on the left is the Smiths' house")

我们得到:

['why', 'isn', "'", 't', 'alex', "'", 's', 'text', 'token', '##izing', '?', 'the', 'house', 'on', 'the', 'left', 'is', 'the', 'smith', '##s', "'", 'house']

因此,在此步骤中,我们已经丢失了有关最后一个撇号的重要信息。如果以另一种方式进行标记化会更好:

['why', 'isn', "##'", '##t', 'alex', "##'", '##s', 'text', 'token', '##izing', '?', 'the', 'house', 'on', 'the', 'left', 'is', 'the', 'smith', '##s', "##'", 'house']

通过这种方式,记号化将保留所有有关撇号的信息,而且所有格不会有问题。

0 个答案:

没有答案