我已经成功下载了使用CNN-LSTM(https://github.com/tensorflow/models/tree/master/research/lm_1b)训练的1B单词语言模型,我希望能够输入句子或部分句子来获得句子中每个后续单词的概率
例如,如果我有一个句子,"一个说"的动物,我想知道下一个单词出现的概率" woof" vs." meow"。
我知道运行以下命令会产生LSTM嵌入:
bazel-bin/lm_1b/lm_1b_eval --mode dump_lstm_emb \
--pbtxt data/graph-2016-09-10.pbtxt \
--vocab_file data/vocab-2016-09-10.txt \
--ckpt 'data/ckpt-*' \
--sentence "An animal that says woof" \
--save_dir output
这将产生文件lstm_emb_step_*.npy
,其中每个文件是句子中每个单词的LSTM嵌入。如何将这些转换为训练模型的概率,以便能够比较P(woof|An animal that says)
与P(meow|An animal that says)
?
提前致谢。
答案 0 :(得分:0)
我想做同样的事情,这就是我根据他们的一些演示代码改编而成的。我不完全确定这是正确的,但似乎可以产生合理的价值。
def get_probability_of_next_word(sess, t, vocab, prefix_words, query):
"""
Return the probability of the given word based on the sequence of prefix
words.
:param sess: Tensorflow session object
:param t: Tensorflow ??? object
:param vocab: Vocabulary model, maps id <-> string, stores max word chard id length
:param list prefix_words: List of words that appear before this one.
:param str query: The query word
"""
targets = np.zeros([BATCH_SIZE, NUM_TIMESTEPS], np.int32)
weights = np.ones([BATCH_SIZE, NUM_TIMESTEPS], np.float32)
if not prefix_words or prefix_words[0] != "<S>":
prefix_words.insert(0, "<S>")
prefix = [vocab.word_to_id(w) for w in prefix_words]
prefix_char_ids = [vocab.word_to_char_ids(w) for w in prefix_words]
inputs = np.zeros([BATCH_SIZE, NUM_TIMESTEPS], np.int32)
char_ids_inputs = np.zeros(
[BATCH_SIZE, NUM_TIMESTEPS, vocab.max_word_length], np.int32)
inputs[0, 0] = prefix[0]
char_ids_inputs[0, 0, :] = prefix_char_ids[0]
softmax = sess.run(t['softmax_out'],
feed_dict={t['char_inputs_in']: char_ids_inputs,
t['inputs_in']: inputs,
t['targets_in']: targets,
t['target_weights_in']: weights})
return softmax[0, vocab.word_to_id(query)]
示例用法
vocab = CharsVocabulary(vocab_path, MAX_WORD_LEN)
sess, t = LoadModel(model_path, ckptdir + "/ckpt-*")
result = get_probability_of_next_word(sess, t, vocab, ["Hello", "my", "friend"], "for")
给出8.811023e-05
的结果。请注意,CharsVocabulary
和LoadModel
与仓库中的内容略有不同。
还请注意,此功能非常慢。也许有人知道如何改进它。