我是pytorch的初学者,我只想测试具有较长输入文本序列(超过512个标记)的预训练模型CamemBERT,用于问答任务。 因此,我正在尝试使用以下代码来实现它:
class CamemBERTQA(nn.Module):
def __init__(self, hidden_size, num_labels):
super(CamemBERTQA, self).__init__()
self.hidden_size = hidden_size
num_labels = num_labels
self.camembert = CamembertForQuestionAnswering.from_pretrained('fmikaelian/camembert-base-fquad')
self.tokenizer = CamembertTokenizer.from_pretrained('fmikaelian/camembert-base-fquad', do_lower_case=True)
# self.qa_outputs = nn.Linear(self.hidden_size, self.num_labels)
def forward(self, ids):
input_ids = ids
start_scores, end_scores = self.camembert(torch.tensor([input_ids]))
start_logits = torch.argmax(start_scores)
end_logits = torch.argmax(end_scores)+1
outputs = (start_logits, end_logits,)
# print(outputs)
return outputs
def text_representation(self, text):
input_ids = torch.tensor([tokenizer.encode(text, add_special_tokens=True)]) # Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model.
with torch.no_grad():
last_hidden_states = model(input_ids)[0] # Models outputs are now tuples
# print(last_hidden_states)
return last_hidden_states
def tokenize(self, text):
return self.tokenizer.convert_tokens_to_ids(self.tokenizer.tokenize(text))
要处理文本长度序列限制,我想将输入序列分成子序列,并生成每个子序列的文本表示形式,然后使用最大池化层生成最终的文本表示形式(510个标记)。
那么,任何人都可以帮助我以适应此脚本以实现这一目标吗?
谢谢