Question

我创建了一个Python脚本，用于使用doc2vec训练和推断测试文档向量。

我的问题是，当我尝试确定最相似的词组（例如“世界”）时，它仅向我显示最相似的词组。它没有显示最相似短语的列表。

我在代码中缺少什么吗？

#python example to infer document vectors from trained doc2vec model
import gensim.models as g
import codecs

#parameters
model="toy_data/model.bin"
test_docs="toy_data/test_docs.txt"
output_file="toy_data/test_vectors.txt"

#inference hyper-parameters
start_alpha=0.01
infer_epoch=1000

#load model
m = g.Doc2Vec.load(model)
test_docs = [ x.strip().split() for x in codecs.open(test_docs, "r", "utf-8").readlines() ]

#infer test vectors
output = open(output_file, "w")
for d in test_docs:
    output.write( " ".join([str(x) for x in m.infer_vector(d, alpha=start_alpha, steps=infer_epoch)]) + "\n" )
output.flush()
output.close()


m.most_similar('the word'.split())

我得到此列表：

[('refutations', 0.9990279078483582),
 ('volume', 0.9989271759986877),
 ('italic', 0.9988381266593933),
 ('syllogisms', 0.998751699924469),
 ('power', 0.9987285137176514),
 ('alibamu', 0.9985184669494629),
 ("''", 0.99847412109375),
 ('roman', 0.9984466433525085),
 ('soil', 0.9984269738197327),
 ('plants', 0.9984176754951477)]

Answer 1

Doc2Vec模型收集其文档向量以供以后查找或在属性.docvecs中进行搜索。要获取文档向量结果，您可以在该属性上执行most_similar()。如果您的Doc2Vec实例保存在变量d2v_model中，并且doc_id拥有训练中已知的文档标签之一，则可能是：

d2v_model.docvecs.most_similar(doc_id)

如果您要为新文档推断向量，并查找与该推断向量相似的培训文档，则代码可能类似于：

new_dv = d2v_model.infer_vector('some new document'.split())
d2v_model.docvecs.most_similar(positive=[new_dv])

（Doc2Vec模型类是从非常相似的Word2Vec类派生的，因此继承了most_similar()，默认情况下，该Doc2Vec仅查询内部字向量。这些字向量在某些d2v_model.wv.most_similar()模式下或在随机模式下可能会有用–但最好使用d2v_model.docvecs.most_similar()或Doc2Vec来保持清晰。）

基本的gensim示例，例如在docs/notebooks目录doc2vec-lee.ipynb中装有from collections import defaultdict grps = defaultdict(list) for x in lst: grps[tuple(x.features)].append(x)的笔记本中，包含有用的示例。

确定与word2vec最相似的短语

1 个答案: