Question

我有一个BertTokenizer库中的tokenizer（BertModel）和一个model（transformers）。 我已经从头开始对模型进行了预训练，其中包含一些维基百科文章，目的只是为了测试其工作原理。

模型经过预训练后，我想提取给定句子的层矢量表示。为此，我计算了11个隐藏（768尺寸）向量的平均值。我这样做如下（line是单个String）：

padded_sequence = tokenizer(line, padding=True)
        
indexed_tokens = padded_sequence['input_ids']
attention_mask = padded_sequence["attention_mask"]

tokens_tensor = torch.tensor([indexed_tokens])
attention_mask_tensor = torch.tensor([attention_mask])

outputs = model(tokens_tensor, attention_mask_tensor)
hidden_states = outputs[0]

line_vectorized = hidden_states[0].data.numpy().mean(axis=0)

到目前为止一切顺利。 我可以针对每个句子分别执行此操作。但现在我想批量进行，即。我有一堆句子，而不是迭代每个句子，而是发送适当的张量表示以一次获取所有向量。我这样做如下（lines是list of Strings）：

padded_sequences = self.tokenizer_PYTORCH(lines, padding=True)
        
indexed_tokens_list = padded_sequences['input_ids']
attention_mask_list = padded_sequences["attention_mask"]
        
tokens_tensors_list = [torch.tensor([indexed_tokens]) for indexed_tokens in indexed_tokens_list]
attention_mask_tensors_list = [torch.tensor([attention_mask ]) for attention_mask in attention_mask_list ]
        
tokens_tensors = torch.cat((tokens_tensors_list), 0)
attention_mask_tensors = torch.cat((attention_mask_tensors_list ), 0)

outputs = model(tokens_tensors, attention_mask_tensors)
hidden_states = outputs[0]

lines_vectorized = [hidden_states[i].data.numpy().mean(axis=0) for i in range(0, len(hidden_states))]

问题如下：我必须使用填充，以便可以适当地连接令牌张量。这意味着索引标记和注意蒙版可以比以前分别评估句子的情况大。 但是当我使用填充时，对于填充的句子我会得到不同的结果。

示例：我有两个句子（法语，但这没关系）：

sentence_A =“从自由百科全书的文章中摘录的服装”

sentence_B =“导航到有关生物学的文章”

当我分别评估两个句子时，我得到：

sentence_A：

indexed_tokens =  [10002, 3101, 4910, 557, 73, 3215, 9630, 2343, 4200, 8363, 10000]
attention_mask =  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
line_vectorized =  [-0.9304411   0.53798294 -1.6231083 ...]

sentence_B：

indexed_tokens =  [10002, 2217, 6496, 1387, 9876, 2217, 6496, 1387, 4441, 405, 73, 6451, 3, 2190, 5402, 1387, 2971, 10000]
attention_mask =  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
line_vectorized =  [-0.8077076   0.56028104 -1.5135447  ...]

但是当我批量评估两个句子时，我得到：

sentence_A：

indexed_tokens =  [10002, 3101, 4910, 557, 73, 3215, 9630, 2343, 4200, 8363, 10000, 10004, 10004, 10004, 10004, 10004, 10004, 10004]
attention_mask =  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
line_vectorized =  [-1.0473819   0.6090186  -1.727466  ...]

sentence_B：

indexed_tokens =  [10002, 2217, 6496, 1387, 9876, 2217, 6496, 1387, 4441, 405, 73, 6451, 3, 2190, 5402, 1387, 2971, 10000]
attention_mask =  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
line_vectorized =  [-0.8077076   0.56028104 -1.5135447  ...]

也就是说，由于sentence_B大于sentence_A，所以sentence_A被填充，并且注意掩码也被填充了零。已索引的令牌现在包含额外的令牌（10004，我假设为empty）。 sentence_B的矢量表示未更改。但是 sentence_A的矢量表示已更改。

我想知道这是否按预期工作（我认为不是）。而且我想我做错了什么，但我不知道该怎么办。

有什么想法吗？

Answer 1

当您每批处理单个句子时，句子的最大长度是令牌的最大数量，但是，当您批量处理时，句子的最大长度在批处理中保持不变，默认为最长句子中的最大标记数。在这种情况下，1的最大值表示它不是<PAD>令牌，而0表示<PAD>令牌。最好的控制方法是定义最大序列长度，并截断比最大序列长度长的句子。

这可以使用另一种方法来批量标记文本（单个句子可以视为1的批量大小）：

tokenizer = BertTokenizer.from_pretrained("<your bert model>", do_lower_case=True)
encoding = tokenizer.batch_encode_plus(lines, return_tensors='pt',padding=True, truncation=True, max_length=50, add_special_tokens = True) ## Change the max_length to the required max length
indexed_tokens = encoding['input_ids']
attention_mask = encoding['attention_mask']

使用转换器BertModel和BertTokenizer

1 个答案: