我遵循a paper进行基于BERT的词汇替换(特别是尝试实现方程式(2)-如果有人已经实现了整篇论文,那也很好)。因此,我想获得最后的隐藏层(我唯一不确定的是输出中各层的顺序:是最后一个还是第一个?)和基本的BERT模型的注意(没有bert-base的情况)。
但是,我不确定huggingface/transformers library是否真的为bert-base-uncase的输出了注意力(我在使用火炬,但愿意使用TF)?
我期望从what I had read得到一个元组(登录,hidden_states,注意力),但是在下面的示例中(例如在Google Colab中运行),我得到的长度为2。
我误解了我所得到或正在做的事情的错误方式吗?我进行了明显的测试,并使用output_attention=False
代替了output_attention=True
(尽管output_hidden_states=True
确实确实像预期的那样添加了隐藏状态),但是我得到的输出没有变化。对于我对图书馆的理解,这显然是一个不好的信号,或者表示一个问题。
import numpy as np
import torch
!pip install transformers
from transformers import (AutoModelWithLMHead,
AutoTokenizer,
BertConfig)
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
config = BertConfig.from_pretrained('bert-base-uncased', output_hidden_states=True, output_attention=True) # Nothign changes, when I switch to output_attention=False
bert_model = AutoModelWithLMHead.from_config(config)
sequence = "We went to an ice cream cafe and had a chocolate ice cream."
bert_tokenized_sequence = bert_tokenizer.tokenize(sequence)
indexed_tokens = bert_tokenizer.encode(bert_tokenized_sequence, return_tensors='pt')
predictions = bert_model(indexed_tokens)
########## Now let's have a look at what the predictions look like #############
print(len(predictions)) # Length is 2, I expected 3: logits, hidden_layers, attention
print(predictions[0].shape) # torch.Size([1, 16, 30522]) - seems to be logits (shape is 1 x sequence length x vocabulary
print(len(predictions[1])) # Length is 13 - the hidden layers?! There are meant to be 12, right? Is one somehow the attention?
for k in range(len(predictions[1])):
print(predictions[1][k].shape) # These all seem to be torch.Size([1, 16, 768]), so presumably the hidden layers?
import numpy as np
import torch
!pip install transformers
from transformers import BertModel, BertConfig, BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
config = BertConfig.from_pretrained('bert-base-uncased', output_hidden_states=True, output_attentions=True)
model = BertModel.from_pretrained('bert-base-uncased', config=config)
sequence = "We went to an ice cream cafe and had a chocolate ice cream."
tokenized_sequence = tokenizer.tokenize(sequence)
indexed_tokens = tokenizer.encode(tokenized_sequence, return_tensors='pt'
enter code here`outputs = model(indexed_tokens)
print( len(outputs) ) # 4
print( outputs[0].shape ) #1, 16, 768
print( outputs[1].shape ) # 1, 768
print( len(outputs[2]) ) # 13 = input embedding (index 0) + 12 hidden layers (indices 1 to 12)
print( outputs[2][0].shape ) # for each of these 13: 1,16,768 = input sequence, index of each input id in sequence, size of hidden layer
print( len(outputs[3]) ) # 12 (=attenion for each layer)
print( outputs[3][0].shape ) # 0 index = first layer, 1,12,16,16 = , layer, index of each input id in sequence, index of each input id in sequence
答案 0 :(得分:2)
原因是您使用的是AutoModelWithLMHead
,它是实际模型的包装。它调用BERT模型(即BERTModel
的实例),然后将嵌入矩阵用作单词预测的权重矩阵。在基础模型之间确实会引起注意,但是包装器不在乎,仅返回logits。
您可以通过调用AutoModel
直接获得BERT模型。请注意,此模型不返回logits,而是返回隐藏状态。
bert_model = AutoModel.from_config(config)
或者您可以通过调用以下内容从BertWithLMHead
对象中获取它:
wrapped_model = bert_model.base_model
答案 1 :(得分:1)
我认为现在回答这个问题为时已晚,但是随着拥抱面的变形金刚的更新,我认为我们可以使用它
config = BertConfig.from_pretrained('bert-base-uncased',
output_hidden_states=True, output_attentions=True)
bert_model = BertModel.from_pretrained('bert-base-uncased',
config=config)
with torch.no_grad():
out = bert_model(input_ids)
last_hidden_states = out.last_hidden_state
pooler_output = out.pooler_output
hidden_states = out.hidden_states
attentions = out.attentions