Lucene得分似乎完全无法理解。
我有以下一组文件:
Senior Education Recruitment Consultant
Senior IT Recruitment Consultant
Senior Recruitment Consultant
已使用EnglishAnalyzer
分析了这些内容。
搜索查询也是使用QueryParser
使用EnglishAnalyzer
构建的。
当我搜索Senior Recruitment Consultant
时,上述每个文档都返回相同的分数,其中所需(和预期)的结果将是Senior Recruitment Consultant
作为最佳结果。
是否有一种直接的方式来实现我错过的理想行为?
这是我的调试输出:
4.6491017 = (MATCH) sum of:
1.1064172 = (MATCH) weight(Title:senior in 22157) [DefaultSimilarity], result of:
1.1064172 = score(doc=22157,freq=1.0 = termFreq=1.0
), product of:
0.4878372 = queryWeight, product of:
4.53601 = idf(docFreq=818, maxDocs=28116)
0.10754765 = queryNorm
2.268005 = fieldWeight in 22157, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
4.53601 = idf(docFreq=818, maxDocs=28116)
0.5 = fieldNorm(doc=22157)
2.3421772 = (MATCH) weight(Title:recruit in 22157) [DefaultSimilarity], result of:
2.3421772 = score(doc=22157,freq=1.0 = termFreq=1.0
), product of:
0.70978254 = queryWeight, product of:
6.5997033 = idf(docFreq=103, maxDocs=28116)
0.10754765 = queryNorm
3.2998517 = fieldWeight in 22157, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
6.5997033 = idf(docFreq=103, maxDocs=28116)
0.5 = fieldNorm(doc=22157)
1.2005073 = (MATCH) weight(Title:consult in 22157) [DefaultSimilarity], result of:
1.2005073 = score(doc=22157,freq=1.0 = termFreq=1.0
), product of:
0.50815696 = queryWeight, product of:
4.724947 = idf(docFreq=677, maxDocs=28116)
0.10754765 = queryNorm
2.3624735 = fieldWeight in 22157, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
4.724947 = idf(docFreq=677, maxDocs=28116)
0.5 = fieldNorm(doc=22157)
4.6491017 = (MATCH) sum of:
1.1064172 = (MATCH) weight(Title:senior in 22292) [DefaultSimilarity], result of:
1.1064172 = score(doc=22292,freq=1.0 = termFreq=1.0
), product of:
0.4878372 = queryWeight, product of:
4.53601 = idf(docFreq=818, maxDocs=28116)
0.10754765 = queryNorm
2.268005 = fieldWeight in 22292, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
4.53601 = idf(docFreq=818, maxDocs=28116)
0.5 = fieldNorm(doc=22292)
2.3421772 = (MATCH) weight(Title:recruit in 22292) [DefaultSimilarity], result of:
2.3421772 = score(doc=22292,freq=1.0 = termFreq=1.0
), product of:
0.70978254 = queryWeight, product of:
6.5997033 = idf(docFreq=103, maxDocs=28116)
0.10754765 = queryNorm
3.2998517 = fieldWeight in 22292, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
6.5997033 = idf(docFreq=103, maxDocs=28116)
0.5 = fieldNorm(doc=22292)
1.2005073 = (MATCH) weight(Title:consult in 22292) [DefaultSimilarity], result of:
1.2005073 = score(doc=22292,freq=1.0 = termFreq=1.0
), product of:
0.50815696 = queryWeight, product of:
4.724947 = idf(docFreq=677, maxDocs=28116)
0.10754765 = queryNorm
2.3624735 = fieldWeight in 22292, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
4.724947 = idf(docFreq=677, maxDocs=28116)
0.5 = fieldNorm(doc=22292)
4.6491017 = (MATCH) sum of:
1.1064172 = (MATCH) weight(Title:senior in 22494) [DefaultSimilarity], result of:
1.1064172 = score(doc=22494,freq=1.0 = termFreq=1.0
), product of:
0.4878372 = queryWeight, product of:
4.53601 = idf(docFreq=818, maxDocs=28116)
0.10754765 = queryNorm
2.268005 = fieldWeight in 22494, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
4.53601 = idf(docFreq=818, maxDocs=28116)
0.5 = fieldNorm(doc=22494)
2.3421772 = (MATCH) weight(Title:recruit in 22494) [DefaultSimilarity], result of:
2.3421772 = score(doc=22494,freq=1.0 = termFreq=1.0
), product of:
0.70978254 = queryWeight, product of:
6.5997033 = idf(docFreq=103, maxDocs=28116)
0.10754765 = queryNorm
3.2998517 = fieldWeight in 22494, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
6.5997033 = idf(docFreq=103, maxDocs=28116)
0.5 = fieldNorm(doc=22494)
1.2005073 = (MATCH) weight(Title:consult in 22494) [DefaultSimilarity], result of:
1.2005073 = score(doc=22494,freq=1.0 = termFreq=1.0
), product of:
0.50815696 = queryWeight, product of:
4.724947 = idf(docFreq=677, maxDocs=28116)
0.10754765 = queryNorm
2.3624735 = fieldWeight in 22494, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
4.724947 = idf(docFreq=677, maxDocs=28116)
0.5 = fieldNorm(doc=22494)
Senior Education Recruitment Consultant 4.6491017
Senior IT Recruitment Consultant 4.6491017
Senior Recruitment Consultant 4.6491017
答案 0 :(得分:3)
您必须依赖的唯一得分元素是长度范围。
Lengthnorm与索引时的文档一起存储,以及字段的提升。它可以将较短的文档得分更高。
那为什么它不起作用?你有两个问题:
首先:使用极其有损压缩存储规范。它们只占用一个字节,并且具有大约1个精确的十进制数字。所以,基本上,差异不足以影响分数。
关于这种损失的理由,来自DefaultSimilarity
documentation:
...鉴于用户通过查询表达其真实信息需求的困难(和不准确性),只有重大差异很重要。
第二:" IT"是英语中的一句话。您的意思是"信息技术",但所有分析仪看到的都是常见的英语代名词。无论你投入多少停止字,他们都不会影响长度。
这是一个测试,显示了我想出的一些结果:
Senior Education Recruitment Consultant ::: 0.732527
Senior IT Recruitment Consultant ::: 0.732527
Senior Recruitment Consultant ::: 0.732527
if and but Senior IT IT IT IT IT Recruitment this and that Consultant ::: 0.732527
Senior Education Recruitment Consultant Of Justice ::: 0.64096117
Senior Recruitment Consultant and some other nonsense we don't want to know about ::: 0.3662635
正如您所见,与#34;高等教育招聘顾问的正义"我们只添加一个搜索词,而lengthnorm开始产生差异。但是,对于"如果而且高级IT IT IT IT IT招募这个顾问"仍然会看到没有区别,因为所有添加的术语都是常见的英语停止词。
解决方案:您可以使用自定义相似性实现修复规范精度问题,该实现不会很难编码(复制DefaultSimilarity
,并实现无损encodeNormValue
和decodeNormValue
)。您还可以使用自定义或空的停用词列表(通过EnglishAnalyzer ctor)设置分析器。
但是,这可能会让宝宝和洗澡水一起扔掉。如果精确匹配得分更高非常重要,那么您可以通过使用您的查询表达更好的服务,例如:
\"Senior Recruitment Consultant\" Senior Recruitment Consultant
结果:
Senior Recruitment Consultant ::: 1.465054
Senior Recruitment Consultant and some other nonsense we don't want to know about ::: 0.732527
Senior Education Recruitment Consultant ::: 0.27469763
Senior IT Recruitment Consultant ::: 0.27469763
if and but Senior IT IT IT IT IT Recruitment this and that Consultant ::: 0.27469763
Senior Education Recruitment Consultant Of Justice ::: 0.24036042
答案 1 :(得分:0)
正常的lucene排名是基于频率的,并且不考虑单词之间的距离。
但是,您可以添加邻近搜索字词,这需要预先设定距离内的单词才能完成操作(但您需要知道查询中有多少字。
在SO上有类似问题的答案 Lucene.Net: Relevancy by distance between words