Question

我试图使用Lucene来计算许多文档的相似性。对于使用BM25和VSM进行相似度计算。

除了Lucene Im使用GATE，一个执行语言处理任务的OpenSource Framework。

当我试图计算文件（15）之间的相似性时，我遇到了一种奇怪的行为。

使用VSM，我的结果如下：

Post-processing links before ranking
Ranking all links by similarities
3/54 links above similarity 0.15 threshold
54/54 top-most 1.0 similar links
Post-processing links after ranking
Traced 3 link(s) in 9x6 space:
Link = [12695.xml(0,58320)@Bug[15009] | 12713.xml(0,18247)@Feature[1974]]@[1.6188]
Link = [5822.xml(0,10098)@Bug[1434] | 12713.xml(0,18247)@Feature[1974]]@[1.5119]
Link = [12694.xml(0,1504)@Bug[188] | 12713.xml(0,18247)@Feature[1974]]@[0.2702]
Clearing previous runtime results...

Score breakdown:
6.860396E-7 = (MATCH) max of:
  0.0 = (MATCH) MatchAllDocsQuery, product of:
    0.0 = boost
    0.0032560423 = queryNorm
  6.860396E-7 = (MATCH) product of:
    0.0034322562 = (MATCH) sum of:
      0.0017054792 = (MATCH) weight(TERM:http in 1) [DefaultSimilarity], result of:
        0.0017054792 = score(doc=1,freq=2.0), product of:
          0.0045762537 = queryWeight, product of:
            1.4054651 = idf(docFreq=3, maxDocs=6)
            0.0032560423 = queryNorm
          0.37268022 = fieldWeight in 1, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            1.4054651 = idf(docFreq=3, maxDocs=6)
            0.1875 = fieldNorm(doc=1)
      8.6338853E-4 = (MATCH) weight(TERM:use in 1) [DefaultSimilarity], result of:
        8.6338853E-4 = score(doc=1,freq=2.0), product of:
          0.0032560423 = queryWeight, product of:
            1.0 = idf(docFreq=5, maxDocs=6)
            0.0032560423 = queryNorm
          0.26516503 = fieldWeight in 1, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            1.0 = idf(docFreq=5, maxDocs=6)
            0.1875 = fieldNorm(doc=1)
      8.6338853E-4 = (MATCH) weight(TERM:use in 1) [DefaultSimilarity], result of:
        8.6338853E-4 = score(doc=1,freq=2.0), product of:
          0.0032560423 = queryWeight, product of:
            1.0 = idf(docFreq=5, maxDocs=6)
            0.0032560423 = queryNorm
          0.26516503 = fieldWeight in 1, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            1.0 = idf(docFreq=5, maxDocs=6)
            0.1875 = fieldNorm(doc=1)
    1.9988007E-4 = coord(3/15009)

随着BM25我得到一些奇怪的行为。

Post-processing links before ranking
Ranking all links by similarities
40/54 links above similarity 0.15 threshold
54/54 top-most 1.0 similar links
Post-processing links after ranking
Traced 40 link(s) in 9x6 space:
Link = [12695.xml(0,58320)@Bug[15009] | 12713.xml(0,18247)@Feature[1974]]@[10768.2471]
Link = [5822.xml(0,10098)@Bug[1434] | 12713.xml(0,18247)@Feature[1974]]@[1798.1300]
Link = [12695.xml(0,58320)@Bug[15009] | 13091.xml(0,1721)@Feature[216]]@[965.0315]
Link = [5822.xml(0,10098)@Bug[1434] | 13091.xml(0,1721)@Feature[216]]@[372.0819]
Link = [12694.xml(0,1504)@Bug[188] | 12713.xml(0,18247)@Feature[1974]]@[174.2649]
Link = [12695.xml(0,58320)@Bug[15009] | 12700.xml(0,410)@Feature[36]]@[97.6378]
Link = [5822.xml(0,10098)@Bug[1434] | 1910.xml(0,237)@Feature[21]]@[46.4066]
Link = [12694.xml(0,1504)@Bug[188] | 13091.xml(0,1721)@Feature[216]]@[35.8532]
Link = [5822.xml(0,10098)@Bug[1434] | 12701.xml(0,137)@Feature[14]]@[29.6364]
Link = [12698.xml(0,362)@Bug[56] | 12713.xml(0,18247)@Feature[1974]]@[22.4652]
Link = [132.xml(0,409)@Bug[33] | 12713.xml(0,18247)@Feature[1974]]@[21.1697]
Link = [5822.xml(0,10098)@Bug[1434] | 12700.xml(0,410)@Feature[36]]@[16.7317]
Link = [132.xml(0,409)@Bug[33] | 13091.xml(0,1721)@Feature[216]]@[15.8749]
Link = [12697.xml(0,257)@Bug[34] | 12713.xml(0,18247)@Feature[1974]]@[15.5943]
Link = [12696.xml(0,272)@Bug[40] | 12713.xml(0,18247)@Feature[1974]]@[14.8670]
Link = [5822.xml(0,10098)@Bug[1434] | 12702.xml(0,88)@Feature[9]]@[14.8045]
Link = [12694.xml(0,1504)@Bug[188] | 1910.xml(0,237)@Feature[21]]@[13.8415]
Link = [12694.xml(0,1504)@Bug[188] | 12700.xml(0,410)@Feature[36]]@[11.7942]
Link = [12703.xml(0,331)@Bug[43] | 12713.xml(0,18247)@Feature[1974]]@[11.2949]
Link = [12699.xml(0,616)@Bug[67] | 12713.xml(0,18247)@Feature[1974]]@[9.4193]
Link = [12695.xml(0,58320)@Bug[15009] | 12701.xml(0,137)@Feature[14]]@[8.6146]
Link = [12699.xml(0,616)@Bug[67] | 13091.xml(0,1721)@Feature[216]]@[7.1386]
Link = [12695.xml(0,58320)@Bug[15009] | 1910.xml(0,237)@Feature[21]]@[5.9274]
Link = [12698.xml(0,362)@Bug[56] | 13091.xml(0,1721)@Feature[216]]@[4.4054]
Link = [12699.xml(0,616)@Bug[67] | 12700.xml(0,410)@Feature[36]]@[4.0292]
Link = [12703.xml(0,331)@Bug[43] | 13091.xml(0,1721)@Feature[216]]@[3.3257]
Link = [12696.xml(0,272)@Bug[40] | 13091.xml(0,1721)@Feature[216]]@[2.5366]
Link = [12695.xml(0,58320)@Bug[15009] | 12702.xml(0,88)@Feature[9]]@[2.2157]
Link = [12699.xml(0,616)@Bug[67] | 1910.xml(0,237)@Feature[21]]@[2.0420]
Link = [12697.xml(0,257)@Bug[34] | 13091.xml(0,1721)@Feature[216]]@[0.9461]
Link = [12694.xml(0,1504)@Bug[188] | 12702.xml(0,88)@Feature[9]]@[0.9092]
Link = [12694.xml(0,1504)@Bug[188] | 12701.xml(0,137)@Feature[14]]@[0.8928]
Link = [12697.xml(0,257)@Bug[34] | 12702.xml(0,88)@Feature[9]]@[0.8328]
Link = [12696.xml(0,272)@Bug[40] | 12702.xml(0,88)@Feature[9]]@[0.8328]
Link = [12703.xml(0,331)@Bug[43] | 12702.xml(0,88)@Feature[9]]@[0.8328]
Link = [12698.xml(0,362)@Bug[56] | 12702.xml(0,88)@Feature[9]]@[0.8328]
Link = [12703.xml(0,331)@Bug[43] | 12701.xml(0,137)@Feature[14]]@[0.8178]
Link = [12698.xml(0,362)@Bug[56] | 12701.xml(0,137)@Feature[14]]@[0.8178]
Link = [12696.xml(0,272)@Bug[40] | 12701.xml(0,137)@Feature[14]]@[0.8178]
Link = [12697.xml(0,257)@Bug[34] | 12701.xml(0,137)@Feature[14]]@[0.8178]

BM25将所有内容联系起来，因为“好”或高效。说明如下：

Score breakdown:
2.2157059 = (MATCH) max of:
  0.0 = (MATCH) MatchAllDocsQuery, product of:
    0.0 = boost
    1.0 = queryNorm
  2.2157059 = (MATCH) sum of:
    1.3065486 = (MATCH) weight(TERM:http in 1) [BM25Similarity], result of:
      1.3065486 = score(doc=1,freq=2.0 = termFreq=2.0
), product of:
        0.6931472 = idf(docFreq=3, maxDocs=6)
        1.8849511 = tfNorm, computed from:
          2.0 = termFreq=2.0
          1.2 = parameter k1
          0.75 = parameter b
          746.8333 = avgFieldLength
          28.444445 = fieldLength
    0.4545787 = (MATCH) weight(TERM:use in 1) [BM25Similarity], result of:
      0.4545787 = score(doc=1,freq=2.0 = termFreq=2.0
), product of:
        0.24116206 = idf(docFreq=5, maxDocs=6)
        1.8849511 = tfNorm, computed from:
          2.0 = termFreq=2.0
          1.2 = parameter k1
          0.75 = parameter b
          746.8333 = avgFieldLength
          28.444445 = fieldLength
    0.4545787 = (MATCH) weight(TERM:use in 1) [BM25Similarity], result of:
      0.4545787 = score(doc=1,freq=2.0 = termFreq=2.0
), product of:
        0.24116206 = idf(docFreq=5, maxDocs=6)
        1.8849511 = tfNorm, computed from:
          2.0 = termFreq=2.0
          1.2 = parameter k1
          0.75 = parameter b
          746.8333 = avgFieldLength
          28.444445 = fieldLength

出于调试原因，我停用了术语提升和其他内容以查看实际结果。通常所有值均为1或0，如果它们高于1或低于0。

我正在使用Lucene 5.0.0。文件只是通常的门票，可以参考其他门票。

相似之处如下：

new BM25Similarity(k1, b); where k1 = 1.2 and b = 0.75 (defaults). (BM25)
new DefaultSimilarity() (VSM)

分数如此可能如此不同？我可以看到VSM竞争的一切都比较小。

有没有人遇到过这种奇怪的行为？

我很感激任何帮助！

- 编辑

我也想知道在BM25的每个查询中queryNorm等于1.0。但是在VSM中，每个查询都有所不同。

根据这个： Lucene scoring: in what context is queryNorm used?

queryNorm（q）是用于在两者之间进行分数的归一化因子查询可比较。此因素不会影响文档排名（因为所有排名的文档都乘以相同的因子），但是而只是尝试从不同的查询（或甚至不同的指数）可比较。

应该总是一样吗？

Lucene BM25得分

0 个答案: