Question

我是NLP的新手，但是我试图根据句子的语义相似性将句子列表与Python中的另一个句子列表进行匹配。例如，

list1 = ['what they ate for lunch', 'height in inches', 'subjectid']
list2 = ['food eaten two days ago', 'height in centimeters', 'id']

根据先前的帖子和先前的知识，似乎最好的方法是创建每个句子的文档向量并计算列表之间的余弦相似度得分。我发现的关于Doc2Vec的其他帖子以及本教程似乎都集中在预测上。 This post在手工计算方面做得很好，但是我认为Doc2Vec已经可以做到这一点。我正在使用的代码是

import gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

def build_model(train_docs, test_docs, comp_docs):
    '''
    Parameters
    -----------
    train_docs: list of lists - combination of known both sentence list
    test_docs: list of lists - one of the sentence lists
    comp_docs: list of lists - combined sentence lists to match the index to the sentence 
    '''
    # Train model
    model = Doc2Vec(dm = 0, dbow_words = 1, window = 2, alpha = 0.2)#, min_alpha = 0.025)
    model.build_vocab(train_docs)
    for epoch in range(10):
        model.train(train_docs, total_examples = model.corpus_count, epochs = epoch)
        #model.alpha -= 0.002
        #model.min_alpha = model.alpha


    scores = []

    for doc in test_docs:
        dd = {}
        # Calculate the cosine similarity and return top 40 matches
        score = model.docvecs.most_similar([model.infer_vector(doc)],topn=40)
        key = " ".join(doc)
        for i in range(len(score)):
            # Get index and score
            x, y = score[i]
            #print(x)
            # Match sentence from other list
            nkey = ' '.join(comp_docs[x])
            dd[nkey] = y
        scores.append({key: dd})

    return scores

可用来计算相似性分数，但是这里的问题是我必须在两个列表或一个列表中的所有句子上训练模型，然后进行匹配。有没有一种方法可以使用Doc2Vec来获取向量，然后计算余弦相似度？为了清楚起见，我正在尝试查找列表之间最相似的句子。我希望输出类似

scores = []
for s1 in list1:
    for s2 in list2:
        scores.append((s1, s2, similarity(s1, s2)))

print(scores)
[('what they ate for lunch', 'food eaten two days ago', 0.23567),
 ('what they ate for lunch', 'height in centimeters', 0.120),
 ('what they ate for lunch', 'id', 0.01023),
 ('height in inches', 'food eaten two days ago', 0.123),
 ('height in inches', 'height in centimeters', 0.8456),
 ('height in inches', 'id', 0.145),
 ('subjectid', 'food eaten two days ago', 0.156),
 ('subjectid', 'height in centimeters', 0.1345),
 ('subjectid', 'id', 0.9567)]

Answer 1

如果您向Doc2vec提供要为其生成矢量的单词，则Doc2vec可以生成矢量，但是doc2vec模型还是需要存在的。但是，此模型不一定需要针对您要比较的句子进行训练。我不知道doc2vec预生成的模型是否存在，但是我知道您可以导入具有预训练向量的word2vec模型。是否要执行此操作取决于您比较的句子的类型-通常word2vec模型是在维基百科或20newsgroup等语料库上训练的。因此，对于这些文章中不经常出现的单词，他们可能没有矢量（或较差的矢量），即，如果您尝试比较带有许多科学术语的句子，则可能不希望使用预训练的模型。但是，如果不先训练模型，就无法生成向量（我认为这是您的核心问题）。

Answer 2

如果您关注的是训练模型并在运行时获得结果，则是一项耗时的任务。然后考虑保存模式。您可以在单独的文件中训练模型并将其保存到磁盘。

训练结束后立即

UpdateUser

创建一个新文件并按如下所示加载模型，

model.save("similar_sentence.model")

模型文件将保存训练有素的句子中的向量。

可以将模型对象保存并加载到代码中的任何位置。

Semantic “Similar Sentences” with your dataset-NLP

使用Doc2Vec

2 个答案: