Question

Lucene得分似乎完全无法理解。

我有以下一组文件：

Senior Education Recruitment Consultant
Senior IT Recruitment Consultant
Senior Recruitment Consultant

已使用EnglishAnalyzer分析了这些内容。

搜索查询也是使用QueryParser使用EnglishAnalyzer构建的。

当我搜索Senior Recruitment Consultant时，上述每个文档都返回相同的分数，其中所需（和预期）的结果将是Senior Recruitment Consultant作为最佳结果。

是否有一种直接的方式来实现我错过的理想行为？

这是我的调试输出：

4.6491017 = (MATCH) sum of:
  1.1064172 = (MATCH) weight(Title:senior in 22157) [DefaultSimilarity], result of:
    1.1064172 = score(doc=22157,freq=1.0 = termFreq=1.0
), product of:
      0.4878372 = queryWeight, product of:
        4.53601 = idf(docFreq=818, maxDocs=28116)
        0.10754765 = queryNorm
      2.268005 = fieldWeight in 22157, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        4.53601 = idf(docFreq=818, maxDocs=28116)
        0.5 = fieldNorm(doc=22157)
  2.3421772 = (MATCH) weight(Title:recruit in 22157) [DefaultSimilarity], result of:
    2.3421772 = score(doc=22157,freq=1.0 = termFreq=1.0
), product of:
      0.70978254 = queryWeight, product of:
        6.5997033 = idf(docFreq=103, maxDocs=28116)
        0.10754765 = queryNorm
      3.2998517 = fieldWeight in 22157, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        6.5997033 = idf(docFreq=103, maxDocs=28116)
        0.5 = fieldNorm(doc=22157)
  1.2005073 = (MATCH) weight(Title:consult in 22157) [DefaultSimilarity], result of:
    1.2005073 = score(doc=22157,freq=1.0 = termFreq=1.0
), product of:
      0.50815696 = queryWeight, product of:
        4.724947 = idf(docFreq=677, maxDocs=28116)
        0.10754765 = queryNorm
      2.3624735 = fieldWeight in 22157, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        4.724947 = idf(docFreq=677, maxDocs=28116)
        0.5 = fieldNorm(doc=22157)

4.6491017 = (MATCH) sum of:
  1.1064172 = (MATCH) weight(Title:senior in 22292) [DefaultSimilarity], result of:
    1.1064172 = score(doc=22292,freq=1.0 = termFreq=1.0
), product of:
      0.4878372 = queryWeight, product of:
        4.53601 = idf(docFreq=818, maxDocs=28116)
        0.10754765 = queryNorm
      2.268005 = fieldWeight in 22292, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        4.53601 = idf(docFreq=818, maxDocs=28116)
        0.5 = fieldNorm(doc=22292)
  2.3421772 = (MATCH) weight(Title:recruit in 22292) [DefaultSimilarity], result of:
    2.3421772 = score(doc=22292,freq=1.0 = termFreq=1.0
), product of:
      0.70978254 = queryWeight, product of:
        6.5997033 = idf(docFreq=103, maxDocs=28116)
        0.10754765 = queryNorm
      3.2998517 = fieldWeight in 22292, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        6.5997033 = idf(docFreq=103, maxDocs=28116)
        0.5 = fieldNorm(doc=22292)
  1.2005073 = (MATCH) weight(Title:consult in 22292) [DefaultSimilarity], result of:
    1.2005073 = score(doc=22292,freq=1.0 = termFreq=1.0
), product of:
      0.50815696 = queryWeight, product of:
        4.724947 = idf(docFreq=677, maxDocs=28116)
        0.10754765 = queryNorm
      2.3624735 = fieldWeight in 22292, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        4.724947 = idf(docFreq=677, maxDocs=28116)
        0.5 = fieldNorm(doc=22292)

4.6491017 = (MATCH) sum of:
  1.1064172 = (MATCH) weight(Title:senior in 22494) [DefaultSimilarity], result of:
    1.1064172 = score(doc=22494,freq=1.0 = termFreq=1.0
), product of:
      0.4878372 = queryWeight, product of:
        4.53601 = idf(docFreq=818, maxDocs=28116)
        0.10754765 = queryNorm
      2.268005 = fieldWeight in 22494, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        4.53601 = idf(docFreq=818, maxDocs=28116)
        0.5 = fieldNorm(doc=22494)
  2.3421772 = (MATCH) weight(Title:recruit in 22494) [DefaultSimilarity], result of:
    2.3421772 = score(doc=22494,freq=1.0 = termFreq=1.0
), product of:
      0.70978254 = queryWeight, product of:
        6.5997033 = idf(docFreq=103, maxDocs=28116)
        0.10754765 = queryNorm
      3.2998517 = fieldWeight in 22494, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        6.5997033 = idf(docFreq=103, maxDocs=28116)
        0.5 = fieldNorm(doc=22494)
  1.2005073 = (MATCH) weight(Title:consult in 22494) [DefaultSimilarity], result of:
    1.2005073 = score(doc=22494,freq=1.0 = termFreq=1.0
), product of:
      0.50815696 = queryWeight, product of:
        4.724947 = idf(docFreq=677, maxDocs=28116)
        0.10754765 = queryNorm
      2.3624735 = fieldWeight in 22494, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        4.724947 = idf(docFreq=677, maxDocs=28116)
        0.5 = fieldNorm(doc=22494)


Senior Education Recruitment Consultant 4.6491017
Senior IT Recruitment Consultant 4.6491017
Senior Recruitment Consultant 4.6491017

Answer 1

您必须依赖的唯一得分元素是长度范围。

Lengthnorm与索引时的文档一起存储，以及字段的提升。它可以将较短的文档得分更高。

那为什么它不起作用？你有两个问题：

首先：使用极其有损压缩存储规范。它们只占用一个字节，并且具有大约1个精确的十进制数字。所以，基本上，差异不足以影响分数。

关于这种损失的理由，来自DefaultSimilarity documentation：

...鉴于用户通过查询表达其真实信息需求的困难（和不准确性），只有重大差异很重要。

第二：＆＃34; IT＆＃34;是英语中的一句话。您的意思是＆＃34;信息技术＆＃34;，但所有分析仪看到的都是常见的英语代名词。无论你投入多少停止字，他们都不会影响长度。

这是一个测试，显示了我想出的一些结果：

Senior Education Recruitment Consultant ::: 0.732527
Senior IT Recruitment Consultant ::: 0.732527
Senior Recruitment Consultant ::: 0.732527
if and but Senior IT IT IT IT IT Recruitment this and that Consultant ::: 0.732527
Senior Education Recruitment Consultant Of Justice ::: 0.64096117
Senior Recruitment Consultant and some other nonsense we don't want to know about ::: 0.3662635

正如您所见，与＃34;高等教育招聘顾问的正义＆＃34;我们只添加一个搜索词，而lengthnorm开始产生差异。但是，对于＆＃34;如果而且高级IT IT IT IT IT招募这个顾问＆＃34;仍然会看到没有区别，因为所有添加的术语都是常见的英语停止词。

解决方案：您可以使用自定义相似性实现修复规范精度问题，该实现不会很难编码（复制DefaultSimilarity，并实现无损encodeNormValue和decodeNormValue）。您还可以使用自定义或空的停用词列表（通过EnglishAnalyzer ctor）设置分析器。

但是，这可能会让宝宝和洗澡水一起扔掉。如果精确匹配得分更高非常重要，那么您可以通过使用您的查询表达更好的服务，例如：

\"Senior Recruitment Consultant\" Senior Recruitment Consultant

结果：

Senior Recruitment Consultant ::: 1.465054
Senior Recruitment Consultant and some other nonsense we don't want to know about ::: 0.732527
Senior Education Recruitment Consultant ::: 0.27469763
Senior IT Recruitment Consultant ::: 0.27469763
if and but Senior IT IT IT IT IT Recruitment this and that Consultant ::: 0.27469763
Senior Education Recruitment Consultant Of Justice ::: 0.24036042

Answer 2

正常的lucene排名是基于频率的，并且不考虑单词之间的距离。

但是，您可以添加邻近搜索字词，这需要预先设定距离内的单词才能完成操作（但您需要知道查询中有多少字。

在SO上有类似问题的答案 Lucene.Net: Relevancy by distance between words

java Lucene最佳匹配并不完全匹配

2 个答案: