我试图使用Lucene来计算许多文档的相似性。 对于使用BM25和VSM进行相似度计算。
除了Lucene Im使用GATE,一个执行语言处理任务的OpenSource Framework。
当我试图计算文件(15)之间的相似性时,我遇到了一种奇怪的行为。
使用VSM,我的结果如下:
Post-processing links before ranking
Ranking all links by similarities
3/54 links above similarity 0.15 threshold
54/54 top-most 1.0 similar links
Post-processing links after ranking
Traced 3 link(s) in 9x6 space:
Link = [12695.xml(0,58320)@Bug[15009] | 12713.xml(0,18247)@Feature[1974]]@[1.6188]
Link = [5822.xml(0,10098)@Bug[1434] | 12713.xml(0,18247)@Feature[1974]]@[1.5119]
Link = [12694.xml(0,1504)@Bug[188] | 12713.xml(0,18247)@Feature[1974]]@[0.2702]
Clearing previous runtime results...
Score breakdown:
6.860396E-7 = (MATCH) max of:
0.0 = (MATCH) MatchAllDocsQuery, product of:
0.0 = boost
0.0032560423 = queryNorm
6.860396E-7 = (MATCH) product of:
0.0034322562 = (MATCH) sum of:
0.0017054792 = (MATCH) weight(TERM:http in 1) [DefaultSimilarity], result of:
0.0017054792 = score(doc=1,freq=2.0), product of:
0.0045762537 = queryWeight, product of:
1.4054651 = idf(docFreq=3, maxDocs=6)
0.0032560423 = queryNorm
0.37268022 = fieldWeight in 1, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
1.4054651 = idf(docFreq=3, maxDocs=6)
0.1875 = fieldNorm(doc=1)
8.6338853E-4 = (MATCH) weight(TERM:use in 1) [DefaultSimilarity], result of:
8.6338853E-4 = score(doc=1,freq=2.0), product of:
0.0032560423 = queryWeight, product of:
1.0 = idf(docFreq=5, maxDocs=6)
0.0032560423 = queryNorm
0.26516503 = fieldWeight in 1, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
1.0 = idf(docFreq=5, maxDocs=6)
0.1875 = fieldNorm(doc=1)
8.6338853E-4 = (MATCH) weight(TERM:use in 1) [DefaultSimilarity], result of:
8.6338853E-4 = score(doc=1,freq=2.0), product of:
0.0032560423 = queryWeight, product of:
1.0 = idf(docFreq=5, maxDocs=6)
0.0032560423 = queryNorm
0.26516503 = fieldWeight in 1, product of:
1.4142135 = tf(freq=2.0), with freq of:
2.0 = termFreq=2.0
1.0 = idf(docFreq=5, maxDocs=6)
0.1875 = fieldNorm(doc=1)
1.9988007E-4 = coord(3/15009)
随着BM25我得到一些奇怪的行为。
Post-processing links before ranking
Ranking all links by similarities
40/54 links above similarity 0.15 threshold
54/54 top-most 1.0 similar links
Post-processing links after ranking
Traced 40 link(s) in 9x6 space:
Link = [12695.xml(0,58320)@Bug[15009] | 12713.xml(0,18247)@Feature[1974]]@[10768.2471]
Link = [5822.xml(0,10098)@Bug[1434] | 12713.xml(0,18247)@Feature[1974]]@[1798.1300]
Link = [12695.xml(0,58320)@Bug[15009] | 13091.xml(0,1721)@Feature[216]]@[965.0315]
Link = [5822.xml(0,10098)@Bug[1434] | 13091.xml(0,1721)@Feature[216]]@[372.0819]
Link = [12694.xml(0,1504)@Bug[188] | 12713.xml(0,18247)@Feature[1974]]@[174.2649]
Link = [12695.xml(0,58320)@Bug[15009] | 12700.xml(0,410)@Feature[36]]@[97.6378]
Link = [5822.xml(0,10098)@Bug[1434] | 1910.xml(0,237)@Feature[21]]@[46.4066]
Link = [12694.xml(0,1504)@Bug[188] | 13091.xml(0,1721)@Feature[216]]@[35.8532]
Link = [5822.xml(0,10098)@Bug[1434] | 12701.xml(0,137)@Feature[14]]@[29.6364]
Link = [12698.xml(0,362)@Bug[56] | 12713.xml(0,18247)@Feature[1974]]@[22.4652]
Link = [132.xml(0,409)@Bug[33] | 12713.xml(0,18247)@Feature[1974]]@[21.1697]
Link = [5822.xml(0,10098)@Bug[1434] | 12700.xml(0,410)@Feature[36]]@[16.7317]
Link = [132.xml(0,409)@Bug[33] | 13091.xml(0,1721)@Feature[216]]@[15.8749]
Link = [12697.xml(0,257)@Bug[34] | 12713.xml(0,18247)@Feature[1974]]@[15.5943]
Link = [12696.xml(0,272)@Bug[40] | 12713.xml(0,18247)@Feature[1974]]@[14.8670]
Link = [5822.xml(0,10098)@Bug[1434] | 12702.xml(0,88)@Feature[9]]@[14.8045]
Link = [12694.xml(0,1504)@Bug[188] | 1910.xml(0,237)@Feature[21]]@[13.8415]
Link = [12694.xml(0,1504)@Bug[188] | 12700.xml(0,410)@Feature[36]]@[11.7942]
Link = [12703.xml(0,331)@Bug[43] | 12713.xml(0,18247)@Feature[1974]]@[11.2949]
Link = [12699.xml(0,616)@Bug[67] | 12713.xml(0,18247)@Feature[1974]]@[9.4193]
Link = [12695.xml(0,58320)@Bug[15009] | 12701.xml(0,137)@Feature[14]]@[8.6146]
Link = [12699.xml(0,616)@Bug[67] | 13091.xml(0,1721)@Feature[216]]@[7.1386]
Link = [12695.xml(0,58320)@Bug[15009] | 1910.xml(0,237)@Feature[21]]@[5.9274]
Link = [12698.xml(0,362)@Bug[56] | 13091.xml(0,1721)@Feature[216]]@[4.4054]
Link = [12699.xml(0,616)@Bug[67] | 12700.xml(0,410)@Feature[36]]@[4.0292]
Link = [12703.xml(0,331)@Bug[43] | 13091.xml(0,1721)@Feature[216]]@[3.3257]
Link = [12696.xml(0,272)@Bug[40] | 13091.xml(0,1721)@Feature[216]]@[2.5366]
Link = [12695.xml(0,58320)@Bug[15009] | 12702.xml(0,88)@Feature[9]]@[2.2157]
Link = [12699.xml(0,616)@Bug[67] | 1910.xml(0,237)@Feature[21]]@[2.0420]
Link = [12697.xml(0,257)@Bug[34] | 13091.xml(0,1721)@Feature[216]]@[0.9461]
Link = [12694.xml(0,1504)@Bug[188] | 12702.xml(0,88)@Feature[9]]@[0.9092]
Link = [12694.xml(0,1504)@Bug[188] | 12701.xml(0,137)@Feature[14]]@[0.8928]
Link = [12697.xml(0,257)@Bug[34] | 12702.xml(0,88)@Feature[9]]@[0.8328]
Link = [12696.xml(0,272)@Bug[40] | 12702.xml(0,88)@Feature[9]]@[0.8328]
Link = [12703.xml(0,331)@Bug[43] | 12702.xml(0,88)@Feature[9]]@[0.8328]
Link = [12698.xml(0,362)@Bug[56] | 12702.xml(0,88)@Feature[9]]@[0.8328]
Link = [12703.xml(0,331)@Bug[43] | 12701.xml(0,137)@Feature[14]]@[0.8178]
Link = [12698.xml(0,362)@Bug[56] | 12701.xml(0,137)@Feature[14]]@[0.8178]
Link = [12696.xml(0,272)@Bug[40] | 12701.xml(0,137)@Feature[14]]@[0.8178]
Link = [12697.xml(0,257)@Bug[34] | 12701.xml(0,137)@Feature[14]]@[0.8178]
BM25将所有内容联系起来,因为“好”或高效。 说明如下:
Score breakdown:
2.2157059 = (MATCH) max of:
0.0 = (MATCH) MatchAllDocsQuery, product of:
0.0 = boost
1.0 = queryNorm
2.2157059 = (MATCH) sum of:
1.3065486 = (MATCH) weight(TERM:http in 1) [BM25Similarity], result of:
1.3065486 = score(doc=1,freq=2.0 = termFreq=2.0
), product of:
0.6931472 = idf(docFreq=3, maxDocs=6)
1.8849511 = tfNorm, computed from:
2.0 = termFreq=2.0
1.2 = parameter k1
0.75 = parameter b
746.8333 = avgFieldLength
28.444445 = fieldLength
0.4545787 = (MATCH) weight(TERM:use in 1) [BM25Similarity], result of:
0.4545787 = score(doc=1,freq=2.0 = termFreq=2.0
), product of:
0.24116206 = idf(docFreq=5, maxDocs=6)
1.8849511 = tfNorm, computed from:
2.0 = termFreq=2.0
1.2 = parameter k1
0.75 = parameter b
746.8333 = avgFieldLength
28.444445 = fieldLength
0.4545787 = (MATCH) weight(TERM:use in 1) [BM25Similarity], result of:
0.4545787 = score(doc=1,freq=2.0 = termFreq=2.0
), product of:
0.24116206 = idf(docFreq=5, maxDocs=6)
1.8849511 = tfNorm, computed from:
2.0 = termFreq=2.0
1.2 = parameter k1
0.75 = parameter b
746.8333 = avgFieldLength
28.444445 = fieldLength
出于调试原因,我停用了术语提升和其他内容以查看实际结果。 通常所有值均为1或0,如果它们高于1或低于0。
我正在使用Lucene 5.0.0。文件只是通常的门票,可以参考其他门票。
相似之处如下:
new BM25Similarity(k1, b); where k1 = 1.2 and b = 0.75 (defaults). (BM25)
new DefaultSimilarity() (VSM)
分数如此可能如此不同?我可以看到VSM竞争的一切都比较小。
有没有人遇到过这种奇怪的行为?
我很感激任何帮助!
- 编辑
我也想知道在BM25的每个查询中queryNorm等于1.0。 但是在VSM中,每个查询都有所不同。
根据这个: Lucene scoring: in what context is queryNorm used?
queryNorm(q)是用于在两者之间进行分数的归一化因子 查询可比较。此因素不会影响文档排名 (因为所有排名的文档都乘以相同的因子),但是 而只是尝试从不同的查询(或甚至 不同的指数)可比较。
应该总是一样吗?