变压器模型具有最大令牌限制。如果我想将我的文本细分为适合该范围,通常可以采用什么方式?
由于特殊字符的处理,令牌化程序并非将其令牌映射到可循环的对象。天真:
subst = " ".join(mytext.split(" ")[0:MAX_LEN])
让我像这样循环遍历大块
START = 0
i = 0
substr = []
while START+MAX_LEN < len(mytext.split(" ")):
substr[i] = " ".join(mytext.split(" ")[START:START+MAX_LEN])
START = START + MAX_LEN
i = i + 1
tokens = tokenizer(text)
但是," ".join(mytext.split(" ")[0:MAX_LEN])
不等于tokenizer(text)
给出的长度。
您可以看到以下区别:
>>> from transformers import LongformerTokenizer
>>> tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')
>>> mytext = "This is a long sentence. " * 2000 # about 10k tokens
>>> len(mytext.split(" "))
10001
>>> encoded_input = tokenizer(mytext)
Token indices sequence length is longer than the specified maximum sequence length for this model (12003 > 4096). Running this sequence through the model will result in indexing errors
tokenizer
的函数参数是什么?如果没有可用的参数,则是较长文档的普遍接受的迭代过程?