令牌生成器中的令牌到单词的映射解码步骤

时间:2020-06-11 05:33:25

标签: pytorch tokenize huggingface-transformers

tokenizer.decode()函数中是否有办法知道从令牌到原始单词的映射?
例如:

from transformers.tokenization_roberta import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained('roberta-large', do_lower_case=True)

str = "This is a tokenization example"
tokenized = tokenizer.tokenize(str) 
## ['this', 'Ġis', 'Ġa', 'Ġtoken', 'ization', 'Ġexample']

encoded = tokenizer.encode_plus(str) 
## encoded['input_ids']=[0, 42, 16, 10, 19233, 1938, 1246, 2]

decoded = tokenizer.decode(encoded['input_ids']) 
## '<s> this is a tokenization example</s>'

目标是拥有一个将decode进程中的每个令牌映射到正确的输入单词的函数,因为在这里它是:
desired_output = [[1],[2],[3],[4,5],[6]]
因为this对应于ID 42,而tokenization对应于索引{{ [19244,1938]数组中的1}}。

1 个答案:

答案 0 :(得分:2)

据我所知,它们不是内置方法,但是您可以自己创建一个方法:

from transformers.tokenization_roberta import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained('roberta-large', do_lower_case=True)

example = "This is a tokenization example"

print({x : tokenizer.encode(x, add_special_tokens=False, add_prefix_space=True) for x in example.split()})

输出:

{'This': [42], 'is': [16], 'a': [10], 'tokenization': [19233, 1938], 'example': [1246]}

要准确获得所需的输出,必须使用列表理解:

#start index because the number of special tokens is fixed for each model (but be aware of single sentence input and pairwise sentence input)
idx = 1

enc =[tokenizer.encode(x, add_special_tokens=False, add_prefix_space=True) for x in example.split()]

desired_output = []

for token in enc:
    tokenoutput = []
    for ids in token:
      tokenoutput.append(idx)
      idx +=1
    desired_output.append(tokenoutput)

print(desired_output)

输出:

[[1], [2], [3], [4, 5], [6]]