测试BERT的大序列

时间:2020-04-12 10:32:39

标签: python nlp pytorch bert-language-model

我是pytorch的初学者,我只想测试具有较长输入文本序列(超过512个标记)的预训练模型CamemBERT,用于问答任务。 因此,我正在尝试使用以下代码来实现它:

class CamemBERTQA(nn.Module):
  def __init__(self, hidden_size, num_labels):
      super(CamemBERTQA, self).__init__()
      self.hidden_size = hidden_size
      num_labels = num_labels
      self.camembert = CamembertForQuestionAnswering.from_pretrained('fmikaelian/camembert-base-fquad')
      self.tokenizer = CamembertTokenizer.from_pretrained('fmikaelian/camembert-base-fquad', do_lower_case=True)
#         self.qa_outputs = nn.Linear(self.hidden_size, self.num_labels)

  def forward(self, ids):
      input_ids = ids
      start_scores, end_scores = self.camembert(torch.tensor([input_ids]))
      start_logits = torch.argmax(start_scores)
      end_logits = torch.argmax(end_scores)+1
      outputs = (start_logits, end_logits,)
#       print(outputs)
      return outputs

      def text_representation(self, text):
          input_ids = torch.tensor([tokenizer.encode(text, add_special_tokens=True)])  # Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model.
          with torch.no_grad():
              last_hidden_states = model(input_ids)[0]  # Models outputs are now tuples
#             print(last_hidden_states)
          return last_hidden_states

      def tokenize(self, text):
          return self.tokenizer.convert_tokens_to_ids(self.tokenizer.tokenize(text))

要处理文本长度序列限制,我想将输入序列分成子序列,并生成每个子序列的文本表示形式,然后使用最大池化层生成最终的文本表示形式(510个标记)。

那么,任何人都可以帮助我以适应此脚本以实现这一目标吗?

谢谢

0 个答案:

没有答案