使用BERT进行单词嵌入并将张量映射到单词

时间:2020-08-04 09:49:46

标签: neural-network pytorch bert-language-model

我尝试在令牌级别汇总BERT嵌入。对于语料库词汇表中的每个标记,我想创建一个列表,列出所有它们的上下文嵌入,并将它们平均化,以获得词汇表中每个标记的一种表示形式。

代码粘贴在下面。

问题:如何将输出张量(请参见下面代码最后一行的对象token_vecs_sum)映射到特定标记?

预处理数据

!pip install transformers
import torch
from transformers import BertTokenizer
from nltk import tokenize
import nltk
nltk.download('punkt')
import re

MAX_LEN = 64

sentences = ['Some sentences. Some sentences are.', 'Some sentences are really.', 'Some sentences are really hard.']

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

def preprocessing_for_bert(data):
  input_ids = []
  attention_masks = []
  for row in data:
    sents = tokenize.sent_tokenize(row)
    print(sents)
    for sent in sents:
      encoded_sent = tokenizer.encode_plus(text=sent,
                                          add_special_tokens=True,
                                          max_length=MAX_LEN,
                                          pad_to_max_length=True,
                                          return_attention_mask=True,
                                          truncation=True)
      input_ids.append(encoded_sent.get('input_ids'))
      attention_masks.append(encoded_sent.get('attention_mask'))
  # Convert lists to tensors
  input_ids = torch.tensor(input_ids)
  attention_masks = torch.tensor(attention_masks)

  return input_ids, attention_masks


tokens_tensor, segments_tensor = preprocessing_for_bert(sentences)

加载预训练模型

import torch
from transformers import BertTokenizer, BertModel

# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states = True)

# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()

通过BERT运行

# Run the text through BERT, and collect all of the hidden states produced
# from all 12 layers. 
with torch.no_grad():

    outputs = model(tokens_tensor, segments_tensor)

    # Evaluating the model will return a different number of objects based on 
    # how it's  configured in the `from_pretrained` call earlier. In this case, 
    # becase we set `output_hidden_states = True`, the third item will be the 
    # hidden states from all layers.
    hidden_states = outputs[2]

输出结果

print ("Number of layers:", len(hidden_states), "  (initial embeddings + 12 BERT layers)")

layer_i = 0
print ("Number of batches:", len(hidden_states[layer_i]))

batch_i = 0
print ("Number of tokens:", len(hidden_states[layer_i][batch_i]))

token_i = 0
print ("Number of hidden units:", len(hidden_states[layer_i][batch_i][token_i]))

Number of layers: 13   (initial embeddings + 12 BERT layers)
Number of batches: 4
Number of tokens: 64
Number of hidden units: 768

汇总

token_embeddings = torch.stack(hidden_states, dim=0)
# Average over batches
token_embeddings = token_embeddings.mean(1)
token_embeddings = token_embeddings.permute(1,0,2)
token_embeddings.size()
## -> torch.Size([64, 13, 768])

准备令牌嵌入矩阵

token_vecs_sum = []
# For each token in the sentence...
for token in token_embeddings:
  # `token` is a [13 x 768] tensor
  # Sum the vectors from the last four layers.
  sum_vec = torch.sum(token[-4:], dim=0)  
  # Use `sum_vec` to represent `token`.
  token_vecs_sum.append(sum_vec)

print ('Shape is: %d x %d' % (len(token_vecs_sum), len(token_vecs_sum[0])))

0 个答案:

没有答案