BERT 使用 TensorFlow 预训练词嵌入以获得 LSTM 的输入

时间:2021-02-11 10:59:38

标签: python tensorflow keras lstm bert-language-model

我正在尝试使用 imdb 数据集(50,000 条电影评论,分为正面或负面)构建一个看似简单的情感分析模型。我在实践中试图做的是使用预训练的 BERT 词嵌入转换每个数据样本,然后一旦我拥有正确类型的数据,将其用作 LSTM 的输入,该 LSTM 将在最后简单地进行二进制分类。问题是这是我第一次使用词嵌入,在互联网上我一直在寻找不同的东西,我感到非常困惑。

这是我试过的:

# Input
next(iter(imdb))
x, y = next(iter(imdb.batch(2)))
print(x)
print(y)


tf.Tensor(
[b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."
 b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbish. The plot development was constant. Constantly slow and boring. Things seemed to happen, but with no explanation of what was causing them or why. I admit, I may have missed part of the film, but i watched the majority of it and everything just seemed to happen of its own accord without any real concern for anything else. I cant recommend this film at all.'], shape=(2,), dtype=string)
tf.Tensor([0 0], shape=(2,), dtype=int64)
# Preprocessor
tokenizer = tfhub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/")
# BERT model
bert = tfhub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3")

def prep_embeddings(x):
  x = tokenizer(x)
  x = bert(x)['sequence_output']
  return x

我得到的是这个:

<tf.Tensor: shape=(2, 128, 768), dtype=float32, numpy=
array([[[ 0.31838903,  0.00969215,  0.12912029, ..., -0.22067544,
          0.6303109 ,  0.17825796],
        [-0.45852053, -0.5108558 , -0.37394696, ..., -0.6837409 ,
          1.2645687 ,  0.32450032],
        [-0.07489417, -0.7175749 , -0.17673326, ...,  0.17120832,
          0.24352893,  0.6327763 ],
        ...,
        [-0.12189512,  0.20674506,  0.07548869, ..., -0.05297606,
          0.55583966,  0.9672906 ],
        [-0.33615124,  0.33631802, -0.3819321 , ...,  0.25832555,
          0.40085706, -0.00651064],
        [ 0.73269963,  0.47423407,  0.41130567, ...,  0.83341473,
         -0.41439694, -0.35139242]]], dtype=float32)>

在我看来这或多或少是正确的,但它有一个很大的缺陷,那就是序列长度是固定的,而由于我想将其作为 LSTM 的输入,我不需要固定的序列长度,但使用此实现,查看 documentation,似乎必须指定序列长度。有人有什么建议吗?

0 个答案:

没有答案