Question

变压器模型具有最大令牌限制。如果我想将我的文本细分为适合该范围，通常可以采用什么方式？

由于特殊字符的处理，令牌化程序并非将其令牌映射到可循环的对象。天真：

subst = " ".join(mytext.split(" ")[0:MAX_LEN])

让我像这样循环遍历大块

START = 0
i = 0
substr = []
while START+MAX_LEN < len(mytext.split(" ")):
  substr[i] = " ".join(mytext.split(" ")[START:START+MAX_LEN])
  START = START + MAX_LEN
  i = i + 1
  tokens = tokenizer(text)

但是，" ".join(mytext.split(" ")[0:MAX_LEN])不等于tokenizer(text)给出的长度。

您可以看到以下区别：

>>> from transformers import LongformerTokenizer
>>> tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')

>>> mytext = "This is a long sentence. " * 2000 # about 10k tokens

>>> len(mytext.split(" "))
10001

>>> encoded_input = tokenizer(mytext) 
Token indices sequence length is longer than the specified maximum sequence length for this model (12003 > 4096). Running this sequence through the model will result in indexing errors

tokenizer的函数参数是什么？如果没有可用的参数，则是较长文档的普遍接受的迭代过程？

通过Huggingface令牌生成器与其余迭代

0 个答案: