通过Huggingface令牌生成器与其余迭代

时间:2020-10-05 22:05:52

标签: huggingface-tokenizers

变压器模型具有最大令牌限制。如果我想将我的文本细分为适合该范围,通常可以采用什么方式?

由于特殊字符的处理,令牌化程序并非将其令牌映射到可循环的对象。天真:

subst = " ".join(mytext.split(" ")[0:MAX_LEN])

让我像这样循环遍历大块

START = 0
i = 0
substr = []
while START+MAX_LEN < len(mytext.split(" ")):
  substr[i] = " ".join(mytext.split(" ")[START:START+MAX_LEN])
  START = START + MAX_LEN
  i = i + 1
  tokens = tokenizer(text)

但是," ".join(mytext.split(" ")[0:MAX_LEN])不等于tokenizer(text)给出的长度。

您可以看到以下区别:

>>> from transformers import LongformerTokenizer
>>> tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')

>>> mytext = "This is a long sentence. " * 2000 # about 10k tokens

>>> len(mytext.split(" "))
10001

>>> encoded_input = tokenizer(mytext) 
Token indices sequence length is longer than the specified maximum sequence length for this model (12003 > 4096). Running this sequence through the model will result in indexing errors

tokenizer的函数参数是什么?如果没有可用的参数,则是较长文档的普遍接受的迭代过程?

0 个答案:

没有答案