使用权重标准化排名分数

时间:2015-08-11 00:41:20

标签: python nlp nltk normalize cosine-similarity

我正在处理一个文档搜索问题,在这个问题中,给定一组文档和一个搜索查询,我想找到最接近查询的文档。我使用的模型基于scikit中的TfidfVectorizer。我通过使用4种不同类型的标记化器为所有文档创建了4个不同的tf_idf向量。每个标记器将字符串拆分为n-gram,其中n在1 ... 4范围内。

例如:

doc_1 = "Singularity is still a confusing phenomenon in physics"
doc_2 = "Quantum theory still wins over String theory"

因此model_1将使用1-gram标记器,model_2将使用2-gram标记器。

接下来针对给定的搜索查询,我使用这4个模型计算搜索词与所有其他文档之间的余弦相似度。

例如,搜索查询:量子物理学中的奇点。 搜索查询分解为n-gram,tf_idf值从相应的n-gram模型计算。

因此,对于每个查询 - 文档对,我基于所使用的n-gram模型具有4个相似度值。 例如:

1-gram similarity = 0.4370303325246957
2-gram similarity = 0.36617374546988996
3-gram similarity = 0.29519246156322099
4-gram similarity = 0.2902998188509896

所有这些相似性得分在0到1的范围内归一化。现在我想计算聚合的归一化得分,使得对于任何查询 - 文档对,较高的n-gram相似性得到非常高的权重。基本上,ngram相似度越高,它对整体得分的影响越大。

有人可以建议一个解决方案吗?

1 个答案:

答案 0 :(得分:2)

有很多方法可以解决这些问题:

>>> onegram_sim = 0.43
>>> twogram_sim = 0.36
>>> threegram_sim = 0.29
>>> fourgram_sim = 0.29
# Sum(x) / len(list)
>>> all_sim = sum([onegram_sim, twogram_sim, threegram_sim, fourgram_sim]) / 4
>>> all_sim
0.3425
# Sum(x*x) / len(list)
>>> all_sim = sum(map(lambda x: x**2, [onegram_sim, twogram_sim, threegram_sim, fourgram_sim])) / 4
>>> all_sim
0.120675
# Product(x)
>>> from operator import mul
>>> onetofour_sim = [onegram_sim, twogram_sim, threegram_sim, fourgram_sim]
>>> reduce(mul, onetofour_sim, 1)
0.013018679999999998

最终,无论是什么让你获得更好的准确度得分都是最佳的解决方案。

超越你的问题:

要计算文档相似度,有一个长期运行的SemEval任务调用语义文本相似性 https://groups.google.com/forum/#!forum/sts-semeval

共同策略包括(并非详尽无遗):

  1. 使用带有相似性分数的带注释语料库对句子对,提取一些特征,训练回归量并输出相似度得分

  2. 使用某种向量空间语义(强烈建议阅读:http://www.jair.org/media/2934/live-2934-4846-jair.pdf)然后做一些向量相似性得分(看看How to calculate cosine similarity given 2 sentence strings? - Python

    我。矢量空间语义行话的一个子集将派上用场(有时称为单词嵌入),有时人们使用主题模型/神经网络/深度学习(其他相关的流行语)训练向量空间,请参阅http://u.cs.biu.ac.il/~yogo/cvsc2015.pdf

    II。您还可以使用更传统的词袋矢量并使用TF-IDF或任何其他“潜在”降维来压缩空间,然后使用一些矢量相似度函数来获得相似性

    III。创建一个奇特的矢量相似度函数(例如cosmul,参见https://radimrehurek.com/gensim/models/word2vec.html),然后调整函数并在不同的空间进行评估。

  3. 使用包含概念本体的一些词汇资源(例如WordNet,Cyc等),然后通过遍历概念图来比较相似性(参见http://www.nltk.org/howto/wordnet.html)。一个例子是https://github.com/alvations/pywsd/blob/master/pywsd/similarity.py

  4. 鉴于以上为背景,没有注释,让我们试着破解一些向量空间的例子:

    enter image description here

    首先让我们尝试使用简单的二进制向量的简单ngram:

    import numpy as np
    from nltk import ngrams
    
    doc1 = "Singularity is still a confusing phenomenon in physics".split()
    doc2 = "Quantum theory still wins over String theory".split()
    _vec1 = list(ngrams(doc1, 3))
    _vec2 = list(ngrams(doc2, 3))
    # Create a full dictionary of all possible ngrams.
    vec_dict = list(set(_vec1).union(_vec2))
    print 'Vector Dict:', vec_dict
    # Now vectorize the documents
    vec1 = [1 if ng in _vec1 else 0 for ng in vec_dict]
    vec2 = [1 if ng in _vec2 else 0 for ng in vec_dict]
    print 'Vectorzied:', vec1, vec2
    print 'Similarity:', np.dot(vec1, vec2)
    

    [OUT]:

    Vector Dict: [('still', 'a', 'confusing'), ('confusing', 'phenomenon', 'in'), ('theory', 'still', 'wins'), ('is', 'still', 'a'), ('over', 'String', 'theory'), ('a', 'confusing', 'phenomenon'), ('wins', 'over', 'String'), ('Singularity', 'is', 'still'), ('still', 'wins', 'over'), ('phenomenon', 'in', 'physics'), ('Quantum', 'theory', 'still')] 
    
    Vectorzied: [1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0] [0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1] 
    
    Similarity: 0 
    

    现在让我们尝试从1gram到ngrams(其中n = len(sent)),并将所有内容放在带有二进制ngrams的向量字典中:

    import numpy as np
    from nltk import ngrams
    
    def everygrams(sequence):
        """
        This function returns all possible ngrams for n 
        ranging from 1 to len(sequence).
        >>> list(everygrams('a b c'.split()))
        [('a',), ('b',), ('c',), ('a', 'b'), ('b', 'c'), ('a', 'b', 'c')]
        """
        for n in range(1, len(sequence)+1):
            for ng in ngrams(sequence, n):
                yield ng
    
    doc1 = "Singularity is still a confusing phenomenon in physics".split()
    doc2 = "Quantum theory still wins over String theory".split()
    _vec1 = list(everygrams(doc1))
    _vec2 = list(everygrams(doc2))
    # Create a full dictionary of all possible ngrams.
    vec_dict = list(set(_vec1).union(_vec2))
    print 'Vector Dict:', vec_dict, '\n'
    # Now vectorize the documents
    vec1 = [1 if ng in _vec1 else 0 for ng in vec_dict]
    vec2 = [1 if ng in _vec2 else 0 for ng in vec_dict]
    print 'Vectorzied:', vec1, vec2, '\n'
    print 'Similarity:', np.dot(vec1, vec2), '\n'
    

    [OUT]:

    Vector Dict: [('still', 'a'), ('over', 'String'), ('theory', 'still', 'wins', 'over', 'String', 'theory'), ('String', 'theory'), ('physics',), ('in',), ('wins', 'over', 'String', 'theory'), ('is', 'still', 'a', 'confusing', 'phenomenon', 'in'), ('theory', 'still', 'wins'), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon'), ('a',), ('wins',), ('is', 'still', 'a'), ('Singularity', 'is'), ('phenomenon', 'in'), ('still', 'wins', 'over', 'String'), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('Quantum', 'theory', 'still', 'wins', 'over'), ('a', 'confusing', 'phenomenon'), ('Singularity', 'is', 'still', 'a'), ('confusing', 'phenomenon'), ('confusing', 'phenomenon', 'in', 'physics'), ('Singularity', 'is', 'still'), ('is', 'still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('wins', 'over'), ('theory', 'still', 'wins', 'over'), ('phenomenon',), ('Quantum', 'theory', 'still', 'wins', 'over', 'String'), ('is', 'still'), ('still', 'wins', 'over'), ('is', 'still', 'a', 'confusing', 'phenomenon'), ('phenomenon', 'in', 'physics'), ('Quantum', 'theory', 'still', 'wins'), ('Quantum', 'theory', 'still'), ('a', 'confusing', 'phenomenon', 'in', 'physics'), ('Singularity', 'is', 'still', 'a', 'confusing'), ('still', 'a', 'confusing', 'phenomenon', 'in'), ('still', 'a', 'confusing'), ('is', 'still', 'a', 'confusing'), ('in', 'physics'), ('Quantum', 'theory', 'still', 'wins', 'over', 'String', 'theory'), ('confusing', 'phenomenon', 'in'), ('theory', 'still'), ('Quantum', 'theory'), ('is',), ('String',), ('over', 'String', 'theory'), ('still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('a', 'confusing'), ('still', 'wins'), ('still',), ('over',), ('still', 'a', 'confusing', 'phenomenon'), ('wins', 'over', 'String'), ('Singularity',), ('confusing',), ('theory',), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon', 'in'), ('still', 'wins', 'over', 'String', 'theory'), ('a', 'confusing', 'phenomenon', 'in'), ('Quantum',), ('theory', 'still', 'wins', 'over', 'String')] 
    
    Vectorzied: [1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0] [0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1] 
    
    Similarity: 1 
    

    现在让我们尝试正常化。可能的ngrams:

    import numpy as np
    from nltk import ngrams
    
    def everygrams(sequence):
        """
        This function returns all possible ngrams for n 
        ranging from 1 to len(sequence).
        >>> list(everygrams('a b c'.split()))
        [('a',), ('b',), ('c',), ('a', 'b'), ('b', 'c'), ('a', 'b', 'c')]
        """
        for n in range(1, len(sequence)+1):
            for ng in ngrams(sequence, n):
                yield ng
    
    doc1 = "Singularity is still a confusing phenomenon in physics".split()
    doc2 = "Quantum theory still wins over String theory".split()
    _vec1 = list(everygrams(doc1))
    _vec2 = list(everygrams(doc2))
    # Create a full dictionary of all possible ngrams.
    vec_dict = list(set(_vec1).union(_vec2))
    print 'Vector Dict:', vec_dict, '\n'
    # Now vectorize the documents
    vec1 = [1/float(len(_vec1)) if ng in _vec1 else 0 for ng in vec_dict]
    vec2 = [1/float(len(_vec2)) if ng in _vec2 else 0 for ng in vec_dict]
    print 'Vectorzied:', vec1, vec2, '\n'
    print 'Similarity:', np.dot(vec1, vec2), '\n'
    

    它看起来更好,出来:

    Vector Dict: [('still', 'a'), ('over', 'String'), ('theory', 'still', 'wins', 'over', 'String', 'theory'), ('String', 'theory'), ('physics',), ('in',), ('wins', 'over', 'String', 'theory'), ('is', 'still', 'a', 'confusing', 'phenomenon', 'in'), ('theory', 'still', 'wins'), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon'), ('a',), ('wins',), ('is', 'still', 'a'), ('Singularity', 'is'), ('phenomenon', 'in'), ('still', 'wins', 'over', 'String'), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('Quantum', 'theory', 'still', 'wins', 'over'), ('a', 'confusing', 'phenomenon'), ('Singularity', 'is', 'still', 'a'), ('confusing', 'phenomenon'), ('confusing', 'phenomenon', 'in', 'physics'), ('Singularity', 'is', 'still'), ('is', 'still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('wins', 'over'), ('theory', 'still', 'wins', 'over'), ('phenomenon',), ('Quantum', 'theory', 'still', 'wins', 'over', 'String'), ('is', 'still'), ('still', 'wins', 'over'), ('is', 'still', 'a', 'confusing', 'phenomenon'), ('phenomenon', 'in', 'physics'), ('Quantum', 'theory', 'still', 'wins'), ('Quantum', 'theory', 'still'), ('a', 'confusing', 'phenomenon', 'in', 'physics'), ('Singularity', 'is', 'still', 'a', 'confusing'), ('still', 'a', 'confusing', 'phenomenon', 'in'), ('still', 'a', 'confusing'), ('is', 'still', 'a', 'confusing'), ('in', 'physics'), ('Quantum', 'theory', 'still', 'wins', 'over', 'String', 'theory'), ('confusing', 'phenomenon', 'in'), ('theory', 'still'), ('Quantum', 'theory'), ('is',), ('String',), ('over', 'String', 'theory'), ('still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('a', 'confusing'), ('still', 'wins'), ('still',), ('over',), ('still', 'a', 'confusing', 'phenomenon'), ('wins', 'over', 'String'), ('Singularity',), ('confusing',), ('theory',), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon', 'in'), ('still', 'wins', 'over', 'String', 'theory'), ('a', 'confusing', 'phenomenon', 'in'), ('Quantum',), ('theory', 'still', 'wins', 'over', 'String')] 
    
    Vectorzied: [0.027777777777777776, 0, 0, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0, 0, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0, 0, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0, 0.027777777777777776, 0, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0, 0] [0, 0.03571428571428571, 0.03571428571428571, 0.03571428571428571, 0, 0, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0.03571428571428571, 0, 0, 0, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0, 0, 0, 0, 0.03571428571428571, 0.03571428571428571, 0, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0.03571428571428571, 0.03571428571428571, 0, 0, 0, 0, 0, 0, 0.03571428571428571, 0, 0.03571428571428571, 0.03571428571428571, 0, 0.03571428571428571, 0.03571428571428571, 0, 0, 0.03571428571428571, 0.03571428571428571, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0.03571428571428571, 0, 0.03571428571428571, 0, 0.03571428571428571, 0.03571428571428571] 
    
    Similarity: 0.000992063492063 
    

    现在让我们尝试计算ngrams而不是1/len(_vec),即_vec.count(ng) / len(_vec)

    import numpy as np
    from nltk import ngrams
    
    def everygrams(sequence):
        """
        This function returns all possible ngrams for n 
        ranging from 1 to len(sequence).
        >>> list(everygrams('a b c'.split()))
        [('a',), ('b',), ('c',), ('a', 'b'), ('b', 'c'), ('a', 'b', 'c')]
        """
        for n in range(1, len(sequence)+1):
            for ng in ngrams(sequence, n):
                yield ng
    
    doc1 = "Singularity is still a confusing phenomenon in physics".split()
    doc2 = "Quantum theory still wins over String theory".split()
    _vec1 = list(everygrams(doc1))
    _vec2 = list(everygrams(doc2))
    # Create a full dictionary of all possible ngrams.
    vec_dict = list(set(_vec1).union(_vec2))
    print 'Vector Dict:', vec_dict, '\n'
    # Now vectorize the documents
    vec1 = [_vec1.count(ng)/float(len(_vec1)) if ng in _vec1 else 0 for ng in vec_dict]
    vec2 = [_vec2.count(ng)/float(len(_vec2)) if ng in _vec2 else 0 for ng in vec_dict]
    print 'Vectorzied:', vec1, vec2, '\n'
    print 'Similarity:', np.dot(vec1, vec2), '\n'
    

    不出所料,由于计数均为1,因此相似度得分相同:

    Vector Dict: [('still', 'a'), ('over', 'String'), ('theory', 'still', 'wins', 'over', 'String', 'theory'), ('String', 'theory'), ('physics',), ('in',), ('wins', 'over', 'String', 'theory'), ('is', 'still', 'a', 'confusing', 'phenomenon', 'in'), ('theory', 'still', 'wins'), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon'), ('a',), ('wins',), ('is', 'still', 'a'), ('Singularity', 'is'), ('phenomenon', 'in'), ('still', 'wins', 'over', 'String'), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('Quantum', 'theory', 'still', 'wins', 'over'), ('a', 'confusing', 'phenomenon'), ('Singularity', 'is', 'still', 'a'), ('confusing', 'phenomenon'), ('confusing', 'phenomenon', 'in', 'physics'), ('Singularity', 'is', 'still'), ('is', 'still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('wins', 'over'), ('theory', 'still', 'wins', 'over'), ('phenomenon',), ('Quantum', 'theory', 'still', 'wins', 'over', 'String'), ('is', 'still'), ('still', 'wins', 'over'), ('is', 'still', 'a', 'confusing', 'phenomenon'), ('phenomenon', 'in', 'physics'), ('Quantum', 'theory', 'still', 'wins'), ('Quantum', 'theory', 'still'), ('a', 'confusing', 'phenomenon', 'in', 'physics'), ('Singularity', 'is', 'still', 'a', 'confusing'), ('still', 'a', 'confusing', 'phenomenon', 'in'), ('still', 'a', 'confusing'), ('is', 'still', 'a', 'confusing'), ('in', 'physics'), ('Quantum', 'theory', 'still', 'wins', 'over', 'String', 'theory'), ('confusing', 'phenomenon', 'in'), ('theory', 'still'), ('Quantum', 'theory'), ('is',), ('String',), ('over', 'String', 'theory'), ('still', 'a', 'confusing', 'phenomenon', 'in', 'physics'), ('a', 'confusing'), ('still', 'wins'), ('still',), ('over',), ('still', 'a', 'confusing', 'phenomenon'), ('wins', 'over', 'String'), ('Singularity',), ('confusing',), ('theory',), ('Singularity', 'is', 'still', 'a', 'confusing', 'phenomenon', 'in'), ('still', 'wins', 'over', 'String', 'theory'), ('a', 'confusing', 'phenomenon', 'in'), ('Quantum',), ('theory', 'still', 'wins', 'over', 'String')] 
    
    Vectorzied: [0.027777777777777776, 0, 0, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0, 0, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0, 0, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0, 0.027777777777777776, 0, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0.027777777777777776, 0, 0.027777777777777776, 0, 0.027777777777777776, 0, 0] [0, 0.03571428571428571, 0.03571428571428571, 0.03571428571428571, 0, 0, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0.03571428571428571, 0, 0, 0, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0, 0, 0, 0, 0.03571428571428571, 0.03571428571428571, 0, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0.03571428571428571, 0.03571428571428571, 0, 0, 0, 0, 0, 0, 0.03571428571428571, 0, 0.03571428571428571, 0.03571428571428571, 0, 0.03571428571428571, 0.03571428571428571, 0, 0, 0.03571428571428571, 0.03571428571428571, 0.03571428571428571, 0, 0.03571428571428571, 0, 0, 0.07142857142857142, 0, 0.03571428571428571, 0, 0.03571428571428571, 0.03571428571428571] 
    
    Similarity: 0.000992063492063 
    

    除了ngrams之外,您也可以尝试使用skipgrams:How to compute skipgrams in python?