最近我发布了此question,并试图解决我的问题。我的问题是
(['New Delhi is the capital of India', 'The capital of India is Delhi'])
,即使我添加cls和sep令牌,长度也分别为9和8。max_seq_len参数为10,那为什么{{1}的最后一行}和x1
不同吗?x2
设为0,将整个段落作为单个句子传递。正确吗?segment_ids
文件。但我看不到tokenization.py
。我看到文件vocab.txt
。我可以使用30k-clean.vocab
代替30k-clean.vocab
吗?答案 0 :(得分:2)
@ user2543622,您可以参考官方代码here,在这种情况下,您可以执行以下操作:
import tensorflow_hub as hub
albert_module = hub.Module("https://tfhub.dev/google/albert_base/2", trainable=True)
print(albert_module.get_signature_names()) # should output ['tokens', 'tokenization_info', 'mlm']
# then
tokenization_info = albert_module(signature="tokenization_info",
as_dict=True)
with tf.Session() as sess:
vocab_file, do_lower_case = sess.run([tokenization_info["vocab_file"],
tokenization_info["do_lower_case"]])
print(vocab_file) # output b'/var/folders/v6/vnz79w0d2dn95fj0mtnqs27m0000gn/T/tfhub_modules/098d91f064a4f53dffc7633d00c3d8e87f3a4716/assets/30k-clean.model'
我猜想这个vocab_file
是二进制sentencepiece模型文件,因此您应该按以下方式将此文件标记化,而不要使用30k-clean.vocab。
# you still need the tokenization.py code to perform full tokenization
return tokenization.FullTokenizer(
vocab_file=vocab_file, do_lower_case=do_lower_case,
spm_model_file=FLAGS.spm_model_file)
如果仅需要嵌入矩阵值,则可以查看albert_module.variable_map
,例如:
print(albert_module.variable_map['bert/embeddings/word_embeddings'])
# <tf.Variable 'module/bert/embeddings/word_embeddings:0' shape=(30000, 128) dtype=float32>
答案 1 :(得分:1)
published
这应该为您提供单词标记的列表,而没有number_of_tickets = 10
ticket_serial_numbers = []
for i in range(number_of_tickets):
serial_number = random_number_generator()
ticket_serial_numbers.append(serial_number)
和import tokenization
tokenizer = tokenization.FullTokenizer(vocab_file=<PATH to Vocab file>, do_lower_case=True)
tokens = tokenizer.tokenize(example.text_a)
print(tokens)
标记。
通常,词片标记化会在单词不在词汇表中时拆分单词,这会产生比输入标记数量更多的标记长度。