第1步：堆叠嵌入以获得单个`np.array`

Question

我只看到几个问这个问题的问题，而且他们都没有答案，所以我想我也可以尝试一下。我一直在使用gensim的word2vec模型来创建一些向量。我将它们导出到文本中，并尝试将其导入到tensorflow的嵌入式投影仪的实时模型中。一个问题。 它没有工作。它告诉我，张量的格式不正确。所以，作为初学者，我想我会问一些有更多可能解决方案经验的人相当于我的代码：

import gensim
corpus = [["words","in","sentence","one"],["words","in","sentence","two"]]
model = gensim.models.Word2Vec(iter = 5,size = 64)
model.build_vocab(corpus)
# save memory
vectors = model.wv
del model
vectors.save_word2vec_format("vect.txt",binary = False)

创建模型，保存向量，然后在带有所有维度值的制表符分隔文件中将结果打印出来。我理解如何做我正在做的事情，我只是无法弄清楚我把它放在张量流中的方式有什么问题，因为关于这个问题的文档非常缺乏我的能力。告诉。
提交给我的一个想法是实现适当的tensorflow代码，但我不知道如何编写代码，只是导入实时演示中的文件。

编辑：我现在有一个新问题。我有载体的对象是不可迭代的，因为gensim显然决定使自己的数据结构与我试图做的不兼容。
好。做完了！谢谢你的帮助！

Answer 1

您所描述的是可能的。您必须记住的是Tensorboard从保存的tensorflow二进制文件中读取，这些二进制文件代表磁盘上的变量。

有关保存和恢复张量流图和变量here
的更多信息

因此，主要任务是将嵌入设置为已保存的变量。

假设：

  以下代码中的

embeddings是一个python dict {word:np.array (np.shape==[embedding_size])}



python版本是3.5 +



使用的库包括numpy as np，tensorflow as tf



存储tf变量的目录是model_dir/

第1步：堆叠嵌入以获得单个`np.array`

embeddings_vectors = np.stack(list(embeddings.values(), axis=0))
# shape [n_words, embedding_size]

步骤2：将`tf.Variable`保存在磁盘

上

# Create some variables.
emb = tf.Variable(embeddings_vectors, name='word_embeddings')

# Add an op to initialize the variable.
init_op = tf.global_variables_initializer()

# Add ops to save and restore all the variables.
saver = tf.train.Saver()

# Later, launch the model, initialize the variables and save the
# variables to disk.
with tf.Session() as sess:
   sess.run(init_op)

# Save the variables to disk.
   save_path = saver.save(sess, "model_dir/model.ckpt")
   print("Model saved in path: %s" % save_path)

model_dir应包含checkpoint，model.ckpt-1.data-00000-of-00001，model.ckpt-1.index，model.ckpt-1.meta
文件

第3步：生成`metadata.tsv`

要拥有漂亮的标记嵌入云，您可以提供带有元数据的张量板作为制表符分隔值（tsv）（ cf。 here）。

words = '\n'.join(list(embeddings.keys()))

with open(os.path.join('model_dir', 'metadata.tsv'), 'w') as f:
   f.write(words)

# .tsv file written in model_dir/metadata.tsv

第4步：可视化

运行$ tensorboard --logdir model_dir - ＆gt; <强>投影

要加载元数据，神奇的地方就在这里：

提醒一下，http://projector.tensorflow.org/上还提供了一些 word2vec 嵌入投影

Answer 2

Gensim实际上具有执行此操作的官方方法。

Documentation about it

Answer 3

以上答案对我不起作用。我发现非常有用的是该脚本（以后将添加到gensim中）Source

要将数据转换为元数据：

model = gensim.models.Word2Vec.load_word2vec_format(model_path, binary=True)
with open( tensorsfp, 'w+') as tensors:
    with open( metadatafp, 'w+') as metadata:
         for word in model.index2word:
           encoded=word.encode('utf-8')
           metadata.write(encoded + '\n')
           vector_row = '\t'.join(map(str, model[word]))
           tensors.write(vector_row + '\n')

或遵循此gist

在Tensorboard投影仪中可视化Gensim Word2vec嵌入

3 个答案:

第1步：堆叠嵌入以获得单个`np.array`

步骤2：将`tf.Variable`保存在磁盘

第3步：生成`metadata.tsv`

第4步：可视化

在Tensorboard投影仪中可视化Gensim Word2vec嵌入

3 个答案:

第1步：堆叠嵌入以获得单个np.array

步骤2：将tf.Variable保存在磁盘

第3步：生成metadata.tsv

第4步：可视化

第1步：堆叠嵌入以获得单个`np.array`

步骤2：将`tf.Variable`保存在磁盘

第3步：生成`metadata.tsv`