我对Lucene得分有疑问。我在索引中有两个文档,一个包含“我的名字”,另一个包含“我的名字”。当我搜索关键字“我的名字”时,第二个文档列在第一个文档的上方。我想要的是,如果文档包含我输入的确切关键字,则应首先列出,然后列出另一个。任何人都可以帮我如何做到这一点。感谢。
答案 0 :(得分:3)
第二次尝试答案: Lucene的默认行为应该是你要求的。 这里的关键因素是得分的lengthNorm()部分 - 有时得分较长的文档低于较短的文档。有关上下文,请参阅Lucene's Similarity API。例如,如果两个命中的lengthNorm相同,则它们是任意排序的。
explain()功能将帮助您了解文档按原样评分的原因,而不是根据默认值。
我假设你使用的是BooleanQuery。如果您发布查询的确切方式,我可以说更多。 另请参阅Query Parser Syntax。 我希望这更接近商标。
答案 1 :(得分:0)
如果你从命令行使用lucli(下载最新的Lucene源代码并且它在contrib目录中),你可以使用“explain”命令让Lucene解释为什么它得分如此之高。
它会出现这样的事情:
---------------- 2得分:0.6089077 ---------------------
(等等你的文件)
Explanation:4.260467 = (MATCH) sum of:
0.59024054 = (MATCH) weight(description:warwick in 276780), product of:
0.05595057 = queryWeight(description:warwick), product of:
5.2746606 = idf(docFreq=13531, numDocs=843621)
0.010607426 = queryNorm
10.549321 = (MATCH) fieldWeight(description:warwick in 276780), product of:
1.0 = tf(termFreq(description:warwick)=1)
5.2746606 = idf(docFreq=13531, numDocs=843621)
2.0 = fieldNorm(field=description, doc=276780)
0.832554 = (MATCH) weight(keywords:warwick in 276780), product of:
0.066450186 = queryWeight(keywords:warwick), product of:
6.264497 = idf(docFreq=5028, numDocs=843621)
0.010607426 = queryNorm
12.528994 = (MATCH) fieldWeight(keywords:warwick in 276780), product of:
1.0 = tf(termFreq(keywords:warwick)=1)
6.264497 = idf(docFreq=5028, numDocs=843621)
2.0 = fieldNorm(field=keywords, doc=276780)
0.19180772 = (MATCH) weight(url:warwick in 276780), product of:
0.048220757 = queryWeight(url:warwick), product of:
4.5459433 = idf(docFreq=28043, numDocs=843621)
0.010607426 = queryNorm
3.9777002 = (MATCH) fieldWeight(url:warwick in 276780), product of:
1.0 = tf(termFreq(url:warwick)=1)
4.5459433 = idf(docFreq=28043, numDocs=843621)
0.875 = fieldNorm(field=url, doc=276780)
0.023709858 = (MATCH) weight(content:warwick in 276780), product of:
0.03373665 = queryWeight(content:warwick), product of:
3.1804748 = idf(docFreq=109863, numDocs=843621)
0.010607426 = queryNorm
0.7027923 = (MATCH) fieldWeight(content:warwick in 276780), product of:
1.4142135 = tf(termFreq(content:warwick)=2)
3.1804748 = idf(docFreq=109863, numDocs=843621)
0.15625 = fieldNorm(field=content, doc=276780)
0.46163678 = (MATCH) weight(siteDescription:warwick in 276780), product of:
0.0494812 = queryWeight(siteDescription:warwick), product of:
4.6647696 = idf(docFreq=24901, numDocs=843621)
0.010607426 = queryNorm
9.329539 = (MATCH) fieldWeight(siteDescription:warwick in 276780), product of:
1.0 = tf(termFreq(siteDescription:warwick)=1)
4.6647696 = idf(docFreq=24901, numDocs=843621)
2.0 = fieldNorm(field=siteDescription, doc=276780)
0.96127754 = (MATCH) weight(siteUrl:warwick in 276780), product of:
0.10097861 = queryWeight(siteUrl:warwick), product of:
9.519615 = idf(docFreq=193, numDocs=843621)
0.010607426 = queryNorm
9.519615 = (MATCH) fieldWeight(siteUrl:warwick in 276780), product of:
1.0 = tf(termFreq(siteUrl:warwick)=1)
9.519615 = idf(docFreq=193, numDocs=843621)
1.0 = fieldNorm(field=siteUrl, doc=276780)
0.62917286 = (MATCH) weight(title:warwick in 276780), product of:
0.05776636 = queryWeight(title:warwick), product of:
5.4458413 = idf(docFreq=11402, numDocs=843621)
0.010607426 = queryNorm
10.891683 = (MATCH) fieldWeight(title:warwick in 276780), product of:
1.0 = tf(termFreq(title:warwick)=1)
5.4458413 = idf(docFreq=11402, numDocs=843621)
2.0 = fieldNorm(field=title, doc=276780)
0.57006776 = (MATCH) weight(second_title:warwick in 276780), product of:
0.05498614 = queryWeight(second_title:warwick), product of:
5.18374 = idf(docFreq=14819, numDocs=843621)
0.010607426 = queryNorm
10.36748 = (MATCH) fieldWeight(second_title:warwick in 276780), product of:
1.0 = tf(termFreq(second_title:warwick)=1)
5.18374 = idf(docFreq=14819, numDocs=843621)
2.0 = fieldNorm(field=second_title, doc=276780)
(对不起,我只有一个很大的索引才能得到一个例子,而不是一个简单的例子!)
答案 2 :(得分:0)
我将按如下方式更改查询。
(my AND name) OR "my name"
此处,只要存在词组匹配,附加词组查询就会添加到乐谱中。如果文档具有“我的名字”作为内容,则短语查询将不会产生任何额外分数。但是包含“我的名字”内容的文档会有额外的分数并显示在顶部。
这里,我假设忽略长度归一化。
答案 3 :(得分:0)
我有类似的问题。我使用支持PhraseQuery
的{{1}}解决了这个问题(术语在文档中的相对位置是令牌)。希望这会有所帮助。
查看更多:How can Lucene's scoring depend on relative position of query?