Question

我正在处理一个文档搜索问题，在这个问题中，给定一组文档和一个搜索查询，我想找到最接近查询的文档。我使用的模型基于scikit中的TfidfVectorizer。我通过使用4种不同类型的标记化器为所有文档创建了4个不同的tf_idf向量。每个标记器将字符串拆分为n-gram，其中n在1 ... 4范围内。

例如：

doc_1 = "Singularity is still a confusing phenomenon in physics"
doc_2 = "Quantum theory still wins over String theory"

因此model_1将使用1-gram标记器，model_2将使用2-gram标记器。

接下来针对给定的搜索查询，我使用这4个模型计算搜索词与所有其他文档之间的余弦相似度。

例如，搜索查询：量子物理学中的奇点。搜索查询分解为n-gram，tf_idf值从相应的n-gram模型计算。

因此，对于每个查询 - 文档对，我基于所使用的n-gram模型具有4个相似度值。例如：

1-gram similarity = 0.4370303325246957
2-gram similarity = 0.36617374546988996
3-gram similarity = 0.29519246156322099
4-gram similarity = 0.2902998188509896

所有这些相似性得分在0到1的范围内归一化。现在我想计算聚合的归一化得分，使得对于任何查询 - 文档对，较高的n-gram相似性得到非常高的权重。基本上，ngram相似度越高，它对整体得分的影响越大。

有人可以建议一个解决方案吗？

Answer 1

有很多方法可以解决这些问题：

>>> onegram_sim = 0.43
>>> twogram_sim = 0.36
>>> threegram_sim = 0.29
>>> fourgram_sim = 0.29
# Sum(x) / len(list)
>>> all_sim = sum([onegram_sim, twogram_sim, threegram_sim, fourgram_sim]) / 4
>>> all_sim
0.3425
# Sum(x*x) / len(list)
>>> all_sim = sum(map(lambda x: x**2, [onegram_sim, twogram_sim, threegram_sim, fourgram_sim])) / 4
>>> all_sim
0.120675
# Product(x)
>>> from operator import mul
>>> onetofour_sim = [onegram_sim, twogram_sim, threegram_sim, fourgram_sim]
>>> reduce(mul, onetofour_sim, 1)
0.013018679999999998

最终，无论是什么让你获得更好的准确度得分都是最佳的解决方案。

超越你的问题：

要计算文档相似度，有一个长期运行的SemEval任务调用语义文本相似性 https://groups.google.com/forum/#!forum/sts-semeval

共同策略包括（并非详尽无遗）：

使用带有相似性分数的带注释语料库对句子对，提取一些特征，训练回归量并输出相似度得分
使用某种向量空间语义（强烈建议阅读：http://www.jair.org/media/2934/live-2934-4846-jair.pdf）然后做一些向量相似性得分（看看How to calculate cosine similarity given 2 sentence strings? - Python）

我。矢量空间语义行话的一个子集将派上用场（有时称为单词嵌入），有时人们使用主题模型/神经网络/深度学习（其他相关的流行语）训练向量空间，请参阅http://u.cs.biu.ac.il/~yogo/cvsc2015.pdf

II。您还可以使用更传统的词袋矢量并使用TF-IDF或任何其他“潜在”降维来压缩空间，然后使用一些矢量相似度函数来获得相似性

III。创建一个奇特的矢量相似度函数（例如cosmul，参见https://radimrehurek.com/gensim/models/word2vec.html），然后调整函数并在不同的空间进行评估。
使用包含概念本体的一些词汇资源（例如WordNet，Cyc等），然后通过遍历概念图来比较相似性（参见http://www.nltk.org/howto/wordnet.html）。一个例子是https://github.com/alvations/pywsd/blob/master/pywsd/similarity.py

鉴于以上为背景，没有注释，让我们试着破解一些向量空间的例子：

首先让我们尝试使用简单的二进制向量的简单ngram：

import numpy as np
from nltk import ngrams

doc1 = "Singularity is still a confusing phenomenon in physics".split()
doc2 = "Quantum theory still wins over String theory".split()
_vec1 = list(ngrams(doc1, 3))
_vec2 = list(ngrams(doc2, 3))
# Create a full dictionary of all possible ngrams.
vec_dict = list(set(_vec1).union(_vec2))
print 'Vector Dict:', vec_dict
# Now vectorize the documents
vec1 = [1 if ng in _vec1 else 0 for ng in vec_dict]
vec2 = [1 if ng in _vec2 else 0 for ng in vec_dict]
print 'Vectorzied:', vec1, vec2
print 'Similarity:', np.dot(vec1, vec2)

[OUT]：

Vector Dict: [('still', 'a', 'confusing'), ('confusing', 'phenomenon', 'in'), ('theory', 'still', 'wins'), ('is', 'still', 'a'), ('over', 'String', 'theory'), ('a', 'confusing', 'phenomenon'), ('wins', 'over', 'String'), ('Singularity', 'is', 'still'), ('still', 'wins', 'over'), ('phenomenon', 'in', 'physics'), ('Quantum', 'theory', 'still')] 

Vectorzied: [1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0] [0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1] 

Similarity: 0

现在让我们尝试从1gram到ngrams（其中n = len(sent)），并将所有内容放在带有二进制ngrams的向量字典中：

import numpy as np
from nltk import ngrams

def everygrams(sequence):
    """
    This function returns all possible ngrams for n 
    ranging from 1 to len(sequence).
    >>> list(everygrams('a b c'.split()))
    [('a',), ('b',), ('c',), ('a', 'b'), ('b', 'c'), ('a', 'b', 'c')]
    """
    for n in range(1, len(sequence)+1):
        for ng in ngrams(sequence, n):
            yield ng

doc1 = "Singularity is still a confusing phenomenon in physics".split()
doc2 = "Quantum theory still wins over String theory".split()
_vec1 = list(everygrams(doc1))
_vec2 = list(everygrams(doc2))
# Create a full dictionary of all possible ngrams.
vec_dict = list(set(_vec1).union(_vec2))
print 'Vector Dict:', vec_dict, '\n'
# Now vectorize the documents
vec1 = [1 if ng in _vec1 else 0 for ng in vec_dict]
vec2 = [1 if ng in _vec2 else 0 for ng in vec_dict]
print 'Vectorzied:', vec1, vec2, '\n'
print 'Similarity:', np.dot(vec1, vec2), '\n'

[OUT]：

Vector Dict: [('still', 'a'), ('over', 'String'), ('theory', 'still', 'wins', 'over', 'String', 'theory'), ('String', 'theory'), ('physics',), ('in',), ('wins', 'over', 'String', 'theory'), ('is', 'still', 'a', 'confusing', 'phenomenon', 'in'), ('theory', 'still', 'wins'), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon'), ('a',), ('wins',), ('is', 'still', 'a'), ('Singularity', 'is'), ('phenomenon', 'in'), ('still', 'wins', 'over', 'String'), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('Quantum', 'theory', 'still', 'wins', 'over'), ('a', 'confusing', 'phenomenon'), ('Singularity', 'is', 'still', 'a'), ('confusing', 'phenomenon'), ('confusing', 'phenomenon', 'in', 'physics'), ('Singularity', 'is', 'still'), ('is', 'still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('wins', 'over'), ('theory', 'still', 'wins', 'over'), ('phenomenon',), ('Quantum', 'theory', 'still', 'wins', 'over', 'String'), ('is', 'still'), ('still', 'wins', 'over'), ('is', 'still', 'a', 'confusing', 'phenomenon'), ('phenomenon', 'in', 'physics'), ('Quantum', 'theory', 'still', 'wins'), ('Quantum', 'theory', 'still'), ('a', 'confusing', 'phenomenon', 'in', 'physics'), ('Singularity', 'is', 'still', 'a', 'confusing'), ('still', 'a', 'confusing', 'phenomenon', 'in'), ('still', 'a', 'confusing'), ('is', 'still', 'a', 'confusing'), ('in', 'physics'), ('Quantum', 'theory', 'still', 'wins', 'over', 'String', 'theory'), ('confusing', 'phenomenon', 'in'), ('theory', 'still'), ('Quantum', 'theory'), ('is',), ('String',), ('over', 'String', 'theory'), ('still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('a', 'confusing'), ('still', 'wins'), ('still',), ('over',), ('still', 'a', 'confusing', 'phenomenon'), ('wins', 'over', 'String'), ('Singularity',), ('confusing',), ('theory',), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon', 'in'), ('still', 'wins', 'over', 'String', 'theory'), ('a', 'confusing', 'phenomenon', 'in'), ('Quantum',), ('theory', 'still', 'wins', 'over', 'String')] 

Vectorzied: [1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0] [0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1] 

Similarity: 1

现在让我们尝试正常化。可能的ngrams：

import numpy as np
from nltk import ngrams

def everygrams(sequence):
    """
    This function returns all possible ngrams for n 
    ranging from 1 to len(sequence).
    >>> list(everygrams('a b c'.split()))
    [('a',), ('b',), ('c',), ('a', 'b'), ('b', 'c'), ('a', 'b', 'c')]
    """
    for n in range(1, len(sequence)+1):
        for ng in ngrams(sequence, n):
            yield ng

doc1 = "Singularity is still a confusing phenomenon in physics".split()
doc2 = "Quantum theory still wins over String theory".split()
_vec1 = list(everygrams(doc1))
_vec2 = list(everygrams(doc2))
# Create a full dictionary of all possible ngrams.
vec_dict = list(set(_vec1).union(_vec2))
print 'Vector Dict:', vec_dict, '\n'
# Now vectorize the documents
vec1 = [1/float(len(_vec1)) if ng in _vec1 else 0 for ng in vec_dict]
vec2 = [1/float(len(_vec2)) if ng in _vec2 else 0 for ng in vec_dict]
print 'Vectorzied:', vec1, vec2, '\n'
print 'Similarity:', np.dot(vec1, vec2), '\n'

它看起来更好，出来：

Vector Dict: [('still', 'a'), ('over', 'String'), ('theory', 'still', 'wins', 'over', 'String', 'theory'), ('String', 'theory'), ('physics',), ('in',), ('wins', 'over', 'String', 'theory'), ('is', 'still', 'a', 'confusing', 'phenomenon', 'in'), ('theory', 'still', 'wins'), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon'), ('a',), ('wins',), ('is', 'still', 'a'), ('Singularity', 'is'), ('phenomenon', 'in'), ('still', 'wins', 'over', 'String'), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('Quantum', 'theory', 'still', 'wins', 'over'), ('a', 'confusing', 'phenomenon'), ('Singularity', 'is', 'still', 'a'), ('confusing', 'phenomenon'), ('confusing', 'phenomenon', 'in', 'physics'), ('Singularity', 'is', 'still'), ('is', 'still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('wins', 'over'), ('theory', 'still', 'wins', 'over'), ('phenomenon',), ('Quantum', 'theory', 'still', 'wins', 'over', 'String'), ('is', 'still'), ('still', 'wins', 'over'), ('is', 'still', 'a', 'confusing', 'phenomenon'), ('phenomenon', 'in', 'physics'), ('Quantum', 'theory', 'still', 'wins'), ('Quantum', 'theory', 'still'), ('a', 'confusing', 'phenomenon', 'in', 'physics'), ('Singularity', 'is', 'still', 'a', 'confusing'), ('still', 'a', 'confusing', 'phenomenon', 'in'), ('still', 'a', 'confusing'), ('is', 'still', 'a', 'confusing'), ('in', 'physics'), ('Quantum', 'theory', 'still', 'wins', 'over', 'String', 'theory'), ('confusing', 'phenomenon', 'in'), ('theory', 'still'), ('Quantum', 'theory'), ('is',), ('String',), ('over', 'String', 'theory'), ('still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('a', 'confusing'), ('still', 'wins'), ('still',), ('over',), ('still', 'a', 'confusing', 'phenomenon'), ('wins', 'over', 'String'), ('Singularity',), ('confusing',), ('theory',), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon', 'in'), ('still', 'wins', 'over', 'String', 'theory'), ('a', 'confusing', 'phenomenon', 'in'), ('Quantum',), ('theory', 'still', 'wins', 'over', 'String')] 

Vectorzied: [0.027777777777777776, 0, 0, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0, 0, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0, 0, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0, 0.027777777777777776, 0, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0, 0] [0, 0.03571428571428571, 0.03571428571428571, 0.03571428571428571, 0, 0, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0.03571428571428571, 0, 0, 0, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0, 0, 0, 0, 0.03571428571428571, 0.03571428571428571, 0, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0.03571428571428571, 0.03571428571428571, 0, 0, 0, 0, 0, 0, 0.03571428571428571, 0, 0.03571428571428571, 0.03571428571428571, 0, 0.03571428571428571, 0.03571428571428571, 0, 0, 0.03571428571428571, 0.03571428571428571, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0.03571428571428571, 0, 0.03571428571428571, 0, 0.03571428571428571, 0.03571428571428571] 

Similarity: 0.000992063492063

现在让我们尝试计算ngrams而不是1/len(_vec)，即_vec.count(ng) / len(_vec)：

import numpy as np
from nltk import ngrams

def everygrams(sequence):
    """
    This function returns all possible ngrams for n 
    ranging from 1 to len(sequence).
    >>> list(everygrams('a b c'.split()))
    [('a',), ('b',), ('c',), ('a', 'b'), ('b', 'c'), ('a', 'b', 'c')]
    """
    for n in range(1, len(sequence)+1):
        for ng in ngrams(sequence, n):
            yield ng

doc1 = "Singularity is still a confusing phenomenon in physics".split()
doc2 = "Quantum theory still wins over String theory".split()
_vec1 = list(everygrams(doc1))
_vec2 = list(everygrams(doc2))
# Create a full dictionary of all possible ngrams.
vec_dict = list(set(_vec1).union(_vec2))
print 'Vector Dict:', vec_dict, '\n'
# Now vectorize the documents
vec1 = [_vec1.count(ng)/float(len(_vec1)) if ng in _vec1 else 0 for ng in vec_dict]
vec2 = [_vec2.count(ng)/float(len(_vec2)) if ng in _vec2 else 0 for ng in vec_dict]
print 'Vectorzied:', vec1, vec2, '\n'
print 'Similarity:', np.dot(vec1, vec2), '\n'

不出所料，由于计数均为1，因此相似度得分相同：

Vector Dict: [('still', 'a'), ('over', 'String'), ('theory', 'still', 'wins', 'over', 'String', 'theory'), ('String', 'theory'), ('physics',), ('in',), ('wins', 'over', 'String', 'theory'), ('is', 'still', 'a', 'confusing', 'phenomenon', 'in'), ('theory', 'still', 'wins'), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon'), ('a',), ('wins',), ('is', 'still', 'a'), ('Singularity', 'is'), ('phenomenon', 'in'), ('still', 'wins', 'over', 'String'), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('Quantum', 'theory', 'still', 'wins', 'over'), ('a', 'confusing', 'phenomenon'), ('Singularity', 'is', 'still', 'a'), ('confusing', 'phenomenon'), ('confusing', 'phenomenon', 'in', 'physics'), ('Singularity', 'is', 'still'), ('is', 'still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('wins', 'over'), ('theory', 'still', 'wins', 'over'), ('phenomenon',), ('Quantum', 'theory', 'still', 'wins', 'over', 'String'), ('is', 'still'), ('still', 'wins', 'over'), ('is', 'still', 'a', 'confusing', 'phenomenon'), ('phenomenon', 'in', 'physics'), ('Quantum', 'theory', 'still', 'wins'), ('Quantum', 'theory', 'still'), ('a', 'confusing', 'phenomenon', 'in', 'physics'), ('Singularity', 'is', 'still', 'a', 'confusing'), ('still', 'a', 'confusing', 'phenomenon', 'in'), ('still', 'a', 'confusing'), ('is', 'still', 'a', 'confusing'), ('in', 'physics'), ('Quantum', 'theory', 'still', 'wins', 'over', 'String', 'theory'), ('confusing', 'phenomenon', 'in'), ('theory', 'still'), ('Quantum', 'theory'), ('is',), ('String',), ('over', 'String', 'theory'), ('still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('a', 'confusing'), ('still', 'wins'), ('still',), ('over',), ('still', 'a', 'confusing', 'phenomenon'), ('wins', 'over', 'String'), ('Singularity',), ('confusing',), ('theory',), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon', 'in'), ('still', 'wins', 'over', 'String', 'theory'), ('a', 'confusing', 'phenomenon', 'in'), ('Quantum',), ('theory', 'still', 'wins', 'over', 'String')] 

Vectorzied: [0.027777777777777776, 0, 0, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0, 0, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0, 0, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0, 0.027777777777777776, 0, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0, 0] [0, 0.03571428571428571, 0.03571428571428571, 0.03571428571428571, 0, 0, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0.03571428571428571, 0, 0, 0, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0, 0, 0, 0, 0.03571428571428571, 0.03571428571428571, 0, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0.03571428571428571, 0.03571428571428571, 0, 0, 0, 0, 0, 0, 0.03571428571428571, 0, 0.03571428571428571, 0.03571428571428571, 0, 0.03571428571428571, 0.03571428571428571, 0, 0, 0.03571428571428571, 0.03571428571428571, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0.07142857142857142, 0, 0.03571428571428571, 0, 0.03571428571428571, 0.03571428571428571] 

Similarity: 0.000992063492063

除了ngrams之外，您也可以尝试使用skipgrams：How to compute skipgrams in python?

使用权重标准化排名分数

1 个答案: