Question

中的相同示例，我正在尝试将快速单词移动程序的距离库与SpaCy一起使用。

import spacy
import wmd
nlp = spacy.load('en_core_web_md')
nlp.add_pipe(wmd.WMD.SpacySimilarityHook(nlp), last=True)

doc1 = nlp("Politician speaks to the media in Illinois.")
doc2 = nlp("The president greets the press in Chicago.")
print(doc1.similarity(doc2))

结果是：

6.070106029510498

我不知道如何解释它，因为通常将距离标准化（0到1）。在自述文件中，此结果不可用，因此我不确定我的结果是否错误或此度量标准是否不同。

Answer 1

一个简短的答案：不要解释。像这样使用它：距离越短，句子越相似。对于几乎所有实际应用（例如KNN）来说，这就足够了。

现在答案很长：单词移动距离（读为the paper）定义为“不停”单词的最佳匹配对之间的距离的加权平均值。因此，如果要将其归一化为（0，1），则需要将此最佳总和除以最差情况。

问题在于spacy中的单词向量未规范化（通过打印[sum(t.vector**2) for t in doc1]进行检查）。因此，它们之间的最大距离是无限的。而且，如果您将它们标准化，那么新的WMD将不等同于原始的WMD（即它将以不同的方式对成对的文本进行排序）。因此，没有明显的方法可以标准化您演示的原始spacy-WMD距离。

现在让我们假设单词向量是单位归一化的。如果是这种情况，那么两个单词之间的最大距离就是一个单位球体的直径（即2）。而且2的最大加权平均值仍为2。因此，您需要将文本之间的距离除以2才能使其完全归一化。

您可以通过继承所使用的类将词向量归一化构建到WMD计算中：

import wmd
import numpy
import libwmdrelax

class NormalizedWMDHook(wmd.WMD.SpacySimilarityHook):
    def compute_similarity(self, doc1, doc2):
        """
        Calculates the similarity between two spaCy documents. Extracts the
        nBOW from them and evaluates the WMD.

        :return: The calculated similarity.
        :rtype: float.
        """
        doc1 = self._convert_document(doc1)
        doc2 = self._convert_document(doc2)
        vocabulary = {
            w: i for i, w in enumerate(sorted(set(doc1).union(doc2)))}
        w1 = self._generate_weights(doc1, vocabulary)
        w2 = self._generate_weights(doc2, vocabulary)
        evec = numpy.zeros((len(vocabulary), self.nlp.vocab.vectors_length),
                           dtype=numpy.float32)
        for w, i in vocabulary.items():
            v = self.nlp.vocab[w].vector                                      # MODIFIED
            evec[i] = v / (sum(v**2)**0.5)                                    # MODIFIED
        evec_sqr = (evec * evec).sum(axis=1)
        dists = evec_sqr - 2 * evec.dot(evec.T) + evec_sqr[:, numpy.newaxis]
        dists[dists < 0] = 0
        dists = numpy.sqrt(dists)
        return libwmdrelax.emd(w1, w2, dists) / 2                             # MODIFIED

现在，您可以确定距离已正确归一化：

import spacy
nlp = spacy.load('en_core_web_md')
nlp.add_pipe(NormalizedWMDHook(nlp), last=True)
doc1 = nlp("Politician speaks to the media in Illinois.")
doc2 = nlp("The president greets the press in Chicago.")
print(doc1.similarity(doc2))
print(doc1.similarity(doc1))
print(doc1.similarity(nlp("President speaks to the media in Illinois.")))
print(doc1.similarity(nlp("some irrelevant bullshit")))
print(doc1.similarity(nlp("JDL")))

现在的结果是

0.469503253698349
0.0
0.12690649926662445
0.6037049889564514
0.7507566213607788

P.S。您可以看到，即使在两个非常不相关的文本之间，该归一化的距离也远远小于1。这是因为实际上单词向量并不能覆盖整个单位球体-相反，它们中的大多数都聚集在该单位球体上。因此，即使是非常不同的文本之间的距离通常也将小于1。

SpaCy的单词移动器距离的非标准化结果

1 个答案: