Question

我对solr结果感到困惑。我有索引查询字段

schema.xml配置

window

我有6个文件的索引到solr

<field name="question" type="text_query" indexed="true" stored="true" multiValued="false"/>


<fieldType name="text_query" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
         <tokenizer class="solr.StandardTokenizerFactory"/>
         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>

      <analyzer type="query">
         <tokenizer class="solr.StandardTokenizerFactory"/>
         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
         <filter class="solr.EnglishMinimalStemFilterFactory"/>
         <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>

</fieldType>

当我搜索 查询时：＆＃34; gandhi＆＃34; 我的搜索结果为

来自solr的回应

  [
            {
             "query": "who is gandhi",
             "source": "quora.com",
             "id" : 1
            },
            {
             "query": "who is person know as gandhi",
             "source": "quora.com",
             "id" : 2
            },
            {
             "query": "who is Sachin",
             "source": "quora.com",
             "id" : 3
            },
            {
             "query": "who is mahatma gandhi",
             "source": "quora.com",
             "id" : 4
            },
            {
             "query": "who is gandhis",
             "source": "quora.com",
             "id" : 5
            },
            {
             "query": "who are gandhi brothers",
             "source": "quora.com",
             "id" : 6
            }
]

根据我的配置，我认为我应该在 maxscore

的顶部得到以下结果

"response": {
    "numFound": 4,
    "start": 0,
    "maxScore": 0.8048013,
    "docs": [
      {
        "query": "who is gandhi btothers",
        "id": "6",
        "source": "quora.com",
        "_version_": 1513810901444067300,
        "score": 0.8048013
      },
      {
        "query": "who is person know as gandhi",
        "id": "2",
        "source": "quora.com",
        "_version_": 1513810901436727300,
        "score": 0.643841
      },
      {
        "query": "who is gandhi",
        "id": "1",
        "source": "quora.com",
        "_version_": 1513810901428338700,
        "score": 0.5945348
      },
      {
        "query": "who is mahatma gandhi",
        "id": "4",
        "source": "quora.com",
        "_version_": 1513810901431484400,
        "score": 0.37158427
      }
    ]
  }

调试

中的

解释字段

{
    "query": "who is gandhi",
    "id": "1",
    "source": "quora.com",
    "_version_": 1513810901428338700,
    "score": 0.5945348
}

但结果有所不同。为什么会这样？帮助感谢：）

Answer 1

只是添加我认为的实际解释 - 如果您自己进行分片（又名＆＃34;传统模式＆＃34;），则分数在每个分片上自行计算。如果您有少量文档（或文档不是随机分布在分片中），则每个分片的分数可能与您在检索到最终结果时的分数完全不同。

这不是配置问题，只是在被响应节点合并之前在每个分片上计算分数的结果。

差异应该消失只要＃2碎片不再比碎片＃1多50％的文档（debugQuery输出中的maxdoc =值）。如果你有几百万份文件，两份文件的差异并不大，但是当分片与它们包含的内容之间有50％的差异时，这会对分数产生更大的影响。

请参阅＆＃34; Distributing Documents across Shards＆＃34;：

在传统分布式模式下，Solr不计算通用术语/ doc频率。对于大多数大规模实现，Solr在分片级别计算TD / IDF并不重要。但是，如果您的收藏在服务器上的分布严重偏差，您可能会在搜索中发现误导性的相关性结果。通常，最好将文档随机分发到您的分片。

如果您稍后切换到在SolrCloud模式下运行，那么计算的那部分应该是集合完成的，而不是本地的每个分片（因为它在传统模式下通过手动分片完成）。

Answer 2

似乎问题在于评分算法。对于我的doc 1任何doc 2，我都没有相同的算法。不同算法的原因是多个分片。如果我使用单个分片，我会得到写入输出

"response": {
"numFound": 4,
"start": 0,
"maxScore": 1.1823215,
"docs": [
  {
    "query": "who is gandhi",
    "id": "1",
    "source": "quora.com",
    "_version_": 1513817219957522400,
    "score": 1.1823215
  },
  {
    "query": "who is mahatma gandhi",
    "id": "4",
    "source": "quora.com",
    "_version_": 1513817220025680000,
    "score": 0.73895097
  },
  {
    "query": "who are gandhi brothers",
    "id": "6",
    "source": "quora.com",
    "_version_": 1513817220027777000,
    "score": 0.73895097
  },
  {
    "query": "who is person know as gandhi",
    "id": "2",
    "source": "quora.com",
    "_version_": 1513817220023582700,
    "score": 0.5911608
  }
]

} }

混淆在solr结果中提交的分数

2 个答案: