Question

最近我发布了此question，并试图解决我的问题。我的问题是

我的方法正确吗？
我的示例句子长度分别为7和6-(['New Delhi is the capital of India', 'The capital of India is Delhi'])，即使我添加cls和sep令牌，长度也分别为9和8。max_seq_len参数为10，那为什么{{1}的最后一行}和x1不同吗？
当我的段落超过2个句子时，如何嵌入？我必须一次通过一个句子吗？但是在这种情况下，由于我没有将所有句子都一起传递，我是否会丢失信息？
- 我做了一些进一步的研究，似乎可以将段落中所有单词的x2设为0，将整个段落作为单个句子传递。正确吗？
如何嵌入ALBERT？我看到ALBERT也有segment_ids文件。但我看不到tokenization.py。我看到文件vocab.txt。我可以使用30k-clean.vocab代替30k-clean.vocab吗？

Answer 1

@ user2543622，您可以参考官方代码here，在这种情况下，您可以执行以下操作：

import tensorflow_hub as hub
albert_module = hub.Module("https://tfhub.dev/google/albert_base/2", trainable=True)
print(albert_module.get_signature_names()) # should output ['tokens', 'tokenization_info', 'mlm']
# then 
tokenization_info = albert_module(signature="tokenization_info",
                                  as_dict=True)
with tf.Session() as sess:
  vocab_file, do_lower_case = sess.run([tokenization_info["vocab_file"],
                                        tokenization_info["do_lower_case"]])
print(vocab_file) # output b'/var/folders/v6/vnz79w0d2dn95fj0mtnqs27m0000gn/T/tfhub_modules/098d91f064a4f53dffc7633d00c3d8e87f3a4716/assets/30k-clean.model'

我猜想这个vocab_file是二进制sentencepiece模型文件，因此您应该按以下方式将此文件标记化，而不要使用30k-clean.vocab。

# you still need the tokenization.py code to perform full tokenization
return tokenization.FullTokenizer(
  vocab_file=vocab_file, do_lower_case=do_lower_case,
  spm_model_file=FLAGS.spm_model_file)

如果仅需要嵌入矩阵值，则可以查看albert_module.variable_map，例如：

print(albert_module.variable_map['bert/embeddings/word_embeddings'])
# <tf.Variable 'module/bert/embeddings/word_embeddings:0' shape=(30000, 128) dtype=float32>

Answer 2

您的方法似乎正确
请问您可以使用分词器检查句子1和2的词性化，这可以揭示其中一个句子中是否还有其他单词。可以按如下检查：

published

这应该为您提供单词标记的列表，而没有number_of_tickets = 10 ticket_serial_numbers = [] for i in range(number_of_tickets): serial_number = random_number_generator() ticket_serial_numbers.append(serial_number)和import tokenization tokenizer = tokenization.FullTokenizer(vocab_file=<PATH to Vocab file>, do_lower_case=True) tokens = tokenizer.tokenize(example.text_a) print(tokens)标记。

通常，词片标记化会在单词不在词汇表中时拆分单词，这会产生比输入标记数量更多的标记长度。

您可以将两个句子一起传递，只要词片标记化后的段落长度不超过max_sequence长度。
阿尔伯特的vocab文件位于[CLS]目录中。前提是您已从here获得了阿尔伯特代码。如果您从tf-hub获得模型，则文件为[SEP]

tensorflow_hub将BERT嵌入Windows机器-扩展到albert

2 个答案: