我正在处理一个文档搜索问题,在这个问题中,给定一组文档和一个搜索查询,我想找到最接近查询的文档。我使用的模型基于scikit中的TfidfVectorizer。我通过使用4种不同类型的标记化器为所有文档创建了4个不同的tf_idf向量。每个标记器将字符串拆分为n-gram,其中n在1 ... 4范围内。
例如:
doc_1 = "Singularity is still a confusing phenomenon in physics"
doc_2 = "Quantum theory still wins over String theory"
因此model_1将使用1-gram标记器,model_2将使用2-gram标记器。
接下来针对给定的搜索查询,我使用这4个模型计算搜索词与所有其他文档之间的余弦相似度。
例如,搜索查询:量子物理学中的奇点。 搜索查询分解为n-gram,tf_idf值从相应的n-gram模型计算。
因此,对于每个查询 - 文档对,我基于所使用的n-gram模型具有4个相似度值。 例如:
1-gram similarity = 0.4370303325246957
2-gram similarity = 0.36617374546988996
3-gram similarity = 0.29519246156322099
4-gram similarity = 0.2902998188509896
所有这些相似性得分在0到1的范围内归一化。现在我想计算聚合的归一化得分,使得对于任何查询 - 文档对,较高的n-gram相似性得到非常高的权重。基本上,ngram相似度越高,它对整体得分的影响越大。
有人可以建议一个解决方案吗?
答案 0 :(得分:2)
有很多方法可以解决这些问题:
>>> onegram_sim = 0.43
>>> twogram_sim = 0.36
>>> threegram_sim = 0.29
>>> fourgram_sim = 0.29
# Sum(x) / len(list)
>>> all_sim = sum([onegram_sim, twogram_sim, threegram_sim, fourgram_sim]) / 4
>>> all_sim
0.3425
# Sum(x*x) / len(list)
>>> all_sim = sum(map(lambda x: x**2, [onegram_sim, twogram_sim, threegram_sim, fourgram_sim])) / 4
>>> all_sim
0.120675
# Product(x)
>>> from operator import mul
>>> onetofour_sim = [onegram_sim, twogram_sim, threegram_sim, fourgram_sim]
>>> reduce(mul, onetofour_sim, 1)
0.013018679999999998
最终,无论是什么让你获得更好的准确度得分都是最佳的解决方案。
超越你的问题:
要计算文档相似度,有一个长期运行的SemEval任务调用语义文本相似性 https://groups.google.com/forum/#!forum/sts-semeval
共同策略包括(并非详尽无遗):
使用带有相似性分数的带注释语料库对句子对,提取一些特征,训练回归量并输出相似度得分
使用某种向量空间语义(强烈建议阅读:http://www.jair.org/media/2934/live-2934-4846-jair.pdf)然后做一些向量相似性得分(看看How to calculate cosine similarity given 2 sentence strings? - Python)
我。矢量空间语义行话的一个子集将派上用场(有时称为单词嵌入),有时人们使用主题模型/神经网络/深度学习(其他相关的流行语)训练向量空间,请参阅http://u.cs.biu.ac.il/~yogo/cvsc2015.pdf
II。您还可以使用更传统的词袋矢量并使用TF-IDF或任何其他“潜在”降维来压缩空间,然后使用一些矢量相似度函数来获得相似性
III。创建一个奇特的矢量相似度函数(例如cosmul
,参见https://radimrehurek.com/gensim/models/word2vec.html),然后调整函数并在不同的空间进行评估。
使用包含概念本体的一些词汇资源(例如WordNet,Cyc等),然后通过遍历概念图来比较相似性(参见http://www.nltk.org/howto/wordnet.html)。一个例子是https://github.com/alvations/pywsd/blob/master/pywsd/similarity.py
鉴于以上为背景,没有注释,让我们试着破解一些向量空间的例子:
首先让我们尝试使用简单的二进制向量的简单ngram:
import numpy as np
from nltk import ngrams
doc1 = "Singularity is still a confusing phenomenon in physics".split()
doc2 = "Quantum theory still wins over String theory".split()
_vec1 = list(ngrams(doc1, 3))
_vec2 = list(ngrams(doc2, 3))
# Create a full dictionary of all possible ngrams.
vec_dict = list(set(_vec1).union(_vec2))
print 'Vector Dict:', vec_dict
# Now vectorize the documents
vec1 = [1 if ng in _vec1 else 0 for ng in vec_dict]
vec2 = [1 if ng in _vec2 else 0 for ng in vec_dict]
print 'Vectorzied:', vec1, vec2
print 'Similarity:', np.dot(vec1, vec2)
[OUT]:
Vector Dict: [('still', 'a', 'confusing'), ('confusing', 'phenomenon', 'in'), ('theory', 'still', 'wins'), ('is', 'still', 'a'), ('over', 'String', 'theory'), ('a', 'confusing', 'phenomenon'), ('wins', 'over', 'String'), ('Singularity', 'is', 'still'), ('still', 'wins', 'over'), ('phenomenon', 'in', 'physics'), ('Quantum', 'theory', 'still')]
Vectorzied: [1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0] [0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]
Similarity: 0
现在让我们尝试从1gram到ngrams(其中n = len(sent)
),并将所有内容放在带有二进制ngrams的向量字典中:
import numpy as np
from nltk import ngrams
def everygrams(sequence):
"""
This function returns all possible ngrams for n
ranging from 1 to len(sequence).
>>> list(everygrams('a b c'.split()))
[('a',), ('b',), ('c',), ('a', 'b'), ('b', 'c'), ('a', 'b', 'c')]
"""
for n in range(1, len(sequence)+1):
for ng in ngrams(sequence, n):
yield ng
doc1 = "Singularity is still a confusing phenomenon in physics".split()
doc2 = "Quantum theory still wins over String theory".split()
_vec1 = list(everygrams(doc1))
_vec2 = list(everygrams(doc2))
# Create a full dictionary of all possible ngrams.
vec_dict = list(set(_vec1).union(_vec2))
print 'Vector Dict:', vec_dict, '\n'
# Now vectorize the documents
vec1 = [1 if ng in _vec1 else 0 for ng in vec_dict]
vec2 = [1 if ng in _vec2 else 0 for ng in vec_dict]
print 'Vectorzied:', vec1, vec2, '\n'
print 'Similarity:', np.dot(vec1, vec2), '\n'
[OUT]:
Vector Dict: [('still', 'a'), ('over', 'String'), ('theory', 'still', 'wins', 'over', 'String', 'theory'), ('String', 'theory'), ('physics',), ('in',), ('wins', 'over', 'String', 'theory'), ('is', 'still', 'a', 'confusing', 'phenomenon', 'in'), ('theory', 'still', 'wins'), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon'), ('a',), ('wins',), ('is', 'still', 'a'), ('Singularity', 'is'), ('phenomenon', 'in'), ('still', 'wins', 'over', 'String'), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('Quantum', 'theory', 'still', 'wins', 'over'), ('a', 'confusing', 'phenomenon'), ('Singularity', 'is', 'still', 'a'), ('confusing', 'phenomenon'), ('confusing', 'phenomenon', 'in', 'physics'), ('Singularity', 'is', 'still'), ('is', 'still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('wins', 'over'), ('theory', 'still', 'wins', 'over'), ('phenomenon',), ('Quantum', 'theory', 'still', 'wins', 'over', 'String'), ('is', 'still'), ('still', 'wins', 'over'), ('is', 'still', 'a', 'confusing', 'phenomenon'), ('phenomenon', 'in', 'physics'), ('Quantum', 'theory', 'still', 'wins'), ('Quantum', 'theory', 'still'), ('a', 'confusing', 'phenomenon', 'in', 'physics'), ('Singularity', 'is', 'still', 'a', 'confusing'), ('still', 'a', 'confusing', 'phenomenon', 'in'), ('still', 'a', 'confusing'), ('is', 'still', 'a', 'confusing'), ('in', 'physics'), ('Quantum', 'theory', 'still', 'wins', 'over', 'String', 'theory'), ('confusing', 'phenomenon', 'in'), ('theory', 'still'), ('Quantum', 'theory'), ('is',), ('String',), ('over', 'String', 'theory'), ('still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('a', 'confusing'), ('still', 'wins'), ('still',), ('over',), ('still', 'a', 'confusing', 'phenomenon'), ('wins', 'over', 'String'), ('Singularity',), ('confusing',), ('theory',), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon', 'in'), ('still', 'wins', 'over', 'String', 'theory'), ('a', 'confusing', 'phenomenon', 'in'), ('Quantum',), ('theory', 'still', 'wins', 'over', 'String')]
Vectorzied: [1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0] [0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1]
Similarity: 1
现在让我们尝试正常化。可能的ngrams:
import numpy as np
from nltk import ngrams
def everygrams(sequence):
"""
This function returns all possible ngrams for n
ranging from 1 to len(sequence).
>>> list(everygrams('a b c'.split()))
[('a',), ('b',), ('c',), ('a', 'b'), ('b', 'c'), ('a', 'b', 'c')]
"""
for n in range(1, len(sequence)+1):
for ng in ngrams(sequence, n):
yield ng
doc1 = "Singularity is still a confusing phenomenon in physics".split()
doc2 = "Quantum theory still wins over String theory".split()
_vec1 = list(everygrams(doc1))
_vec2 = list(everygrams(doc2))
# Create a full dictionary of all possible ngrams.
vec_dict = list(set(_vec1).union(_vec2))
print 'Vector Dict:', vec_dict, '\n'
# Now vectorize the documents
vec1 = [1/float(len(_vec1)) if ng in _vec1 else 0 for ng in vec_dict]
vec2 = [1/float(len(_vec2)) if ng in _vec2 else 0 for ng in vec_dict]
print 'Vectorzied:', vec1, vec2, '\n'
print 'Similarity:', np.dot(vec1, vec2), '\n'
它看起来更好,出来:
Vector Dict: [('still', 'a'), ('over', 'String'), ('theory', 'still', 'wins', 'over', 'String', 'theory'), ('String', 'theory'), ('physics',), ('in',), ('wins', 'over', 'String', 'theory'), ('is', 'still', 'a', 'confusing', 'phenomenon', 'in'), ('theory', 'still', 'wins'), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon'), ('a',), ('wins',), ('is', 'still', 'a'), ('Singularity', 'is'), ('phenomenon', 'in'), ('still', 'wins', 'over', 'String'), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('Quantum', 'theory', 'still', 'wins', 'over'), ('a', 'confusing', 'phenomenon'), ('Singularity', 'is', 'still', 'a'), ('confusing', 'phenomenon'), ('confusing', 'phenomenon', 'in', 'physics'), ('Singularity', 'is', 'still'), ('is', 'still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('wins', 'over'), ('theory', 'still', 'wins', 'over'), ('phenomenon',), ('Quantum', 'theory', 'still', 'wins', 'over', 'String'), ('is', 'still'), ('still', 'wins', 'over'), ('is', 'still', 'a', 'confusing', 'phenomenon'), ('phenomenon', 'in', 'physics'), ('Quantum', 'theory', 'still', 'wins'), ('Quantum', 'theory', 'still'), ('a', 'confusing', 'phenomenon', 'in', 'physics'), ('Singularity', 'is', 'still', 'a', 'confusing'), ('still', 'a', 'confusing', 'phenomenon', 'in'), ('still', 'a', 'confusing'), ('is', 'still', 'a', 'confusing'), ('in', 'physics'), ('Quantum', 'theory', 'still', 'wins', 'over', 'String', 'theory'), ('confusing', 'phenomenon', 'in'), ('theory', 'still'), ('Quantum', 'theory'), ('is',), ('String',), ('over', 'String', 'theory'), ('still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('a', 'confusing'), ('still', 'wins'), ('still',), ('over',), ('still', 'a', 'confusing', 'phenomenon'), ('wins', 'over', 'String'), ('Singularity',), ('confusing',), ('theory',), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon', 'in'), ('still', 'wins', 'over', 'String', 'theory'), ('a', 'confusing', 'phenomenon', 'in'), ('Quantum',), ('theory', 'still', 'wins', 'over', 'String')]
Vectorzied: [0.027777777777777776, 0, 0, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0, 0, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0, 0, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0, 0.027777777777777776, 0, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0, 0] [0, 0.03571428571428571, 0.03571428571428571, 0.03571428571428571, 0, 0, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0.03571428571428571, 0, 0, 0, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0, 0, 0, 0, 0.03571428571428571, 0.03571428571428571, 0, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0.03571428571428571, 0.03571428571428571, 0, 0, 0, 0, 0, 0, 0.03571428571428571, 0, 0.03571428571428571, 0.03571428571428571, 0, 0.03571428571428571, 0.03571428571428571, 0, 0, 0.03571428571428571, 0.03571428571428571, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0.03571428571428571, 0, 0.03571428571428571, 0, 0.03571428571428571, 0.03571428571428571]
Similarity: 0.000992063492063
现在让我们尝试计算ngrams而不是1/len(_vec)
,即_vec.count(ng) / len(_vec)
:
import numpy as np
from nltk import ngrams
def everygrams(sequence):
"""
This function returns all possible ngrams for n
ranging from 1 to len(sequence).
>>> list(everygrams('a b c'.split()))
[('a',), ('b',), ('c',), ('a', 'b'), ('b', 'c'), ('a', 'b', 'c')]
"""
for n in range(1, len(sequence)+1):
for ng in ngrams(sequence, n):
yield ng
doc1 = "Singularity is still a confusing phenomenon in physics".split()
doc2 = "Quantum theory still wins over String theory".split()
_vec1 = list(everygrams(doc1))
_vec2 = list(everygrams(doc2))
# Create a full dictionary of all possible ngrams.
vec_dict = list(set(_vec1).union(_vec2))
print 'Vector Dict:', vec_dict, '\n'
# Now vectorize the documents
vec1 = [_vec1.count(ng)/float(len(_vec1)) if ng in _vec1 else 0 for ng in vec_dict]
vec2 = [_vec2.count(ng)/float(len(_vec2)) if ng in _vec2 else 0 for ng in vec_dict]
print 'Vectorzied:', vec1, vec2, '\n'
print 'Similarity:', np.dot(vec1, vec2), '\n'
不出所料,由于计数均为1,因此相似度得分相同:
Vector Dict: [('still', 'a'), ('over', 'String'), ('theory', 'still', 'wins', 'over', 'String', 'theory'), ('String', 'theory'), ('physics',), ('in',), ('wins', 'over', 'String', 'theory'), ('is', 'still', 'a', 'confusing', 'phenomenon', 'in'), ('theory', 'still', 'wins'), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon'), ('a',), ('wins',), ('is', 'still', 'a'), ('Singularity', 'is'), ('phenomenon', 'in'), ('still', 'wins', 'over', 'String'), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('Quantum', 'theory', 'still', 'wins', 'over'), ('a', 'confusing', 'phenomenon'), ('Singularity', 'is', 'still', 'a'), ('confusing', 'phenomenon'), ('confusing', 'phenomenon', 'in', 'physics'), ('Singularity', 'is', 'still'), ('is', 'still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('wins', 'over'), ('theory', 'still', 'wins', 'over'), ('phenomenon',), ('Quantum', 'theory', 'still', 'wins', 'over', 'String'), ('is', 'still'), ('still', 'wins', 'over'), ('is', 'still', 'a', 'confusing', 'phenomenon'), ('phenomenon', 'in', 'physics'), ('Quantum', 'theory', 'still', 'wins'), ('Quantum', 'theory', 'still'), ('a', 'confusing', 'phenomenon', 'in', 'physics'), ('Singularity', 'is', 'still', 'a', 'confusing'), ('still', 'a', 'confusing', 'phenomenon', 'in'), ('still', 'a', 'confusing'), ('is', 'still', 'a', 'confusing'), ('in', 'physics'), ('Quantum', 'theory', 'still', 'wins', 'over', 'String', 'theory'), ('confusing', 'phenomenon', 'in'), ('theory', 'still'), ('Quantum', 'theory'), ('is',), ('String',), ('over', 'String', 'theory'), ('still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('a', 'confusing'), ('still', 'wins'), ('still',), ('over',), ('still', 'a', 'confusing', 'phenomenon'), ('wins', 'over', 'String'), ('Singularity',), ('confusing',), ('theory',), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon', 'in'), ('still', 'wins', 'over', 'String', 'theory'), ('a', 'confusing', 'phenomenon', 'in'), ('Quantum',), ('theory', 'still', 'wins', 'over', 'String')]
Vectorzied: [0.027777777777777776, 0, 0, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0, 0, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0, 0, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0, 0.027777777777777776, 0, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0, 0] [0, 0.03571428571428571, 0.03571428571428571, 0.03571428571428571, 0, 0, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0.03571428571428571, 0, 0, 0, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0, 0, 0, 0, 0.03571428571428571, 0.03571428571428571, 0, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0.03571428571428571, 0.03571428571428571, 0, 0, 0, 0, 0, 0, 0.03571428571428571, 0, 0.03571428571428571, 0.03571428571428571, 0, 0.03571428571428571, 0.03571428571428571, 0, 0, 0.03571428571428571, 0.03571428571428571, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0.07142857142857142, 0, 0.03571428571428571, 0, 0.03571428571428571, 0.03571428571428571]
Similarity: 0.000992063492063
除了ngrams之外,您也可以尝试使用skipgrams:How to compute skipgrams in python?