Question

嗨，我正在尝试使用doc2vec查找相似的句子。我找不到的是与受过训练的句子匹配的实际句子。

下面是link

中的代码

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
data = ["I love machine learning. Its awesome.",
        "I love coding in python",
        "I love building chatbots",
        "they chat amagingly well"]

tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(data)]
max_epochs = 100
vec_size = 20
alpha = 0.025

model = Doc2Vec(size=vec_size,
                alpha=alpha, 
                min_alpha=0.00025,
                min_count=1,
                dm =1)

model.build_vocab(tagged_data)

for epoch in range(max_epochs):
    print('iteration {0}'.format(epoch))
    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=model.iter)
    # decrease the learning rate
    model.alpha -= 0.0002
    # fix the learning rate, no decay
    model.min_alpha = model.alpha

model.save("d2v.model")
print("Model Saved")

model= Doc2Vec.load("d2v.model")
#to find the vector of a document which is not in training data
test_data = word_tokenize("I love building chatbots".lower())
v1 = model.infer_vector(test_data)
print("V1_infer", v1)

# to find most similar doc using tags
similar_doc = model.docvecs.most_similar('1')
print(similar_doc)


# to find vector of doc in training data using tags or in other words, printing the vector of document at index 1 in training data
print(model.docvecs['1'])

但是上面的代码仅给我矢量或数字。但是如何从训练数据中获得匹配的实际句子。对于例如-在这种情况下，我期望结果为“我喜欢构建聊天机器人”。

Answer 1

similar_doc的输出为：[('2', 0.991769552230835), ('0', 0.989276111125946), ('3', 0.9854298830032349)]

这显示了data中每个文档与所请求文档的相似度得分，并按降序排列。

基于此，'2' index中的data最接近请求的数据，即test_data。

print(data[int(similar_doc[0][0])])
// prints: I love building chatbots

注意：该代码每次都会给出不同的结果，也许您需要更好的模型或更多的训练数据。

希望这会有所帮助。祝你好运。

Answer 2

Doc2Vec不能在玩具大小的数据集上产生良好的结果，因此在使用更多数据之前，您不应指望任何有意义的事情。

但是，Doc2Vec模型也不会在您内部保留您在培训期间提供的全文。它只记住每个文本的tag的学习向量-通常是唯一的标识符。因此，当您从most_similar()获取结果时，您将获取tag值，然后您需要使用自己的代码/数据自行查找值以检索完整的文档。

分别：

像在做一个循环一样多次调用train()是一个糟糕且容易出错的想法，就像显式管理alpha / min_alpha一样。您不应遵循任何推荐该方法的教程/指南。

请勿更改alpha参数的默认设置，并以您希望的train()计数一次调用epochs –它将进行正确的通过次数和正确的学习率管理。

Answer 3

要获得实际结果，必须将文本作为矢量传递给most_simlar方法以获取实际结果。硬编码most_similar（1）总是可以得到静态结果。

similar_doc = model.docvecs.most_similar([v1])

修改后的代码

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
data = ["I love machine learning. Its awesome.",
        "I love coding in python",
        "I love building chatbots",
        "they chat amagingly well"]

def output_sentences(most_similar):
    for label, index in [('MOST', 0), ('SECOND-MOST', 1), ('MEDIAN', len(most_similar)//2), ('LEAST', len(most_similar) - 1)]:
      print(u'%s %s: %s\n' % (label, most_similar[index][1], data[int(most_similar[index][0])])))

tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(data)]
max_epochs = 100
vec_size = 20
alpha = 0.025

model = Doc2Vec(size=vec_size,
                alpha=alpha, 
                min_alpha=0.00025,
                min_count=1,
                dm =1)

model.build_vocab(tagged_data)

for epoch in range(max_epochs):
    print('iteration {0}'.format(epoch))
    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=model.iter)
    # decrease the learning rate
    model.alpha -= 0.0002
    # fix the learning rate, no decay
    model.min_alpha = model.alpha

model.save("d2v.model")
print("Model Saved")

model= Doc2Vec.load("d2v.model")
#to find the vector of a document which is not in training data
test_data = word_tokenize("I love building chatbots".lower())
v1 = model.infer_vector(test_data)
print("V1_infer", v1)

# to find most similar doc using tags
similar_doc = model.docvecs.most_similar([v1])
print(similar_doc)

# to print similar sentences
output_sentences(similar_doc) 


# to find vector of doc in training data using tags or in other words, printing the vector of document at index 1 in training data
print(model.docvecs['1'])

Semantic “Similar Sentences” with your dataset-NLP

如果您要通过数据集寻找准确的预测，而预测值较少，则可以尝试

pip install similar-sentences

Doc2Vec找到类似的句子

3 个答案: