使用Spacy在文档

时间:2019-05-15 13:33:39

标签: gensim similarity spacy doc2vec sentence-similarity

我正在寻找一种使用most_similar()中的Gensim之类但使用Spacy的解决方案。 我想使用NLP在句子列表中找到最相似的句子。

我尝试逐个循环使用similarity()(例如https://spacy.io/api/doc#similarity)中的Spacy,但这需要很长时间。

更深入:

我想将所有这些句子放在一个图形中(例如this)以查找句子簇。

有什么想法吗?

1 个答案:

答案 0 :(得分:1)

这是一个简单的内置解决方案,您可以使用:

import spacy

nlp = spacy.load("en_core_web_lg")
text = (
    "Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity."
    " These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature."
    " The term semantic similarity is often confused with semantic relatedness."
    " Semantic relatedness includes any relation between two terms, while semantic similarity only includes 'is a' relations."
    " My favorite fruit is apples."
)
doc = nlp(text)
max_similarity = 0.0
most_similar = None, None
for i, sent in enumerate(doc.sents):
    for j, other in enumerate(doc.sents):
        if j <= i:
            continue
        similarity = sent.similarity(other)
        if similarity > max_similarity:
            max_similarity = similarity
            most_similar = sent, other
print("Most similar sentences are:")
print(f"-> '{most_similar[0]}'")
print("and")
print(f"-> '{most_similar[1]}'")
print(f"with a similarity of {max_similarity}")

(来自wikipedia的文字)

它将产生以下输出:

Most similar sentences are:
-> 'Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity.'
and
-> 'These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature.'
with a similarity of 0.9583859443664551

注意来自 spacy.io 的以下信息:

<块引用>

为了使它们紧凑和快速,spaCy 的小型管道包(所有以 sm 结尾的包)不附带词向量,只包含上下文敏感的张量。这意味着您仍然可以使用 Similarity() 方法来比较文档、跨度和标记——但结果不会那么好,并且单个标记不会分配任何向量。所以为了使用真实词向量,你需要下载一个更大的管道包:

- python -m spacy download en_core_web_sm
+ python -m spacy download en_core_web_lg

另请参阅 Document similarity in Spacy vs Word2Vec 以获取有关如何提高相似度分数的建议。