混淆在solr结果中提交的分数

时间:2015-10-01 07:18:05

标签: solr

我对solr结果感到困惑。我有索引查询字段

schema.xml配置

window

我有6个文件的索引到solr

<field name="question" type="text_query" indexed="true" stored="true" multiValued="false"/>


<fieldType name="text_query" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
         <tokenizer class="solr.StandardTokenizerFactory"/>
         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>

      <analyzer type="query">
         <tokenizer class="solr.StandardTokenizerFactory"/>
         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
         <filter class="solr.EnglishMinimalStemFilterFactory"/>
         <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>

</fieldType>

当我搜索 查询时:&#34; gandhi&#34; 我的搜索结果为

来自solr的回应

  [
            {
             "query": "who is gandhi",
             "source": "quora.com",
             "id" : 1
            },
            {
             "query": "who is person know as gandhi",
             "source": "quora.com",
             "id" : 2
            },
            {
             "query": "who is Sachin",
             "source": "quora.com",
             "id" : 3
            },
            {
             "query": "who is mahatma gandhi",
             "source": "quora.com",
             "id" : 4
            },
            {
             "query": "who is gandhis",
             "source": "quora.com",
             "id" : 5
            },
            {
             "query": "who are gandhi brothers",
             "source": "quora.com",
             "id" : 6
            }
]

根据我的配置,我认为我应该在 maxscore

的顶部得到以下结果
"response": {
    "numFound": 4,
    "start": 0,
    "maxScore": 0.8048013,
    "docs": [
      {
        "query": "who is gandhi btothers",
        "id": "6",
        "source": "quora.com",
        "_version_": 1513810901444067300,
        "score": 0.8048013
      },
      {
        "query": "who is person know as gandhi",
        "id": "2",
        "source": "quora.com",
        "_version_": 1513810901436727300,
        "score": 0.643841
      },
      {
        "query": "who is gandhi",
        "id": "1",
        "source": "quora.com",
        "_version_": 1513810901428338700,
        "score": 0.5945348
      },
      {
        "query": "who is mahatma gandhi",
        "id": "4",
        "source": "quora.com",
        "_version_": 1513810901431484400,
        "score": 0.37158427
      }
    ]
  }
调试

中的

解释字段

{
    "query": "who is gandhi",
    "id": "1",
    "source": "quora.com",
    "_version_": 1513810901428338700,
    "score": 0.5945348
}

但结果有所不同。为什么会这样?帮助感谢:)

2 个答案:

答案 0 :(得分:1)

只是添加我认为的实际解释 - 如果您自己进行分片(又名&#34;传统模式&#34;),则分数在每个分片上自行计算。如果您有少量文档(或文档不是随机分布在分片中),则每个分片的分数可能与您在检索到最终结果时的分数完全不同。

这不是配置问题,只是在被响应节点合并之前在每个分片上计算分数的结果。

差异应该消失只要#2碎片不再比碎片#1多50%的文档(debugQuery输出中的maxdoc =值)。如果你有几百万份文件,两份文件的差异并不大,但是当分片与它们包含的内容之间有50%的差异时,这会对分数产生更大的影响。

请参阅&#34; Distributing Documents across Shards&#34;:

  

在传统分布式模式下,Solr不计算通用术语/ doc频率。对于大多数大规模实现,Solr在分片级别计算TD / IDF并不重要。但是,如果您的收藏在服务器上的分布严重偏差,您可能会在搜索中发现误导性的相关性结果。通常,最好将文档随机分发到您的分片。

如果您稍后切换到在SolrCloud模式下运行,那么计算的那部分应该是集合完成的,而不是本地的每个分片(因为它在传统模式下通过手动分片完成)。

答案 1 :(得分:0)

似乎问题在于评分算法。对于我的doc 1任何doc 2,我都没有相同的算法。不同算法的原因是多个分片。如果我使用单个分片,我会得到写入输出

"response": {
"numFound": 4,
"start": 0,
"maxScore": 1.1823215,
"docs": [
  {
    "query": "who is gandhi",
    "id": "1",
    "source": "quora.com",
    "_version_": 1513817219957522400,
    "score": 1.1823215
  },
  {
    "query": "who is mahatma gandhi",
    "id": "4",
    "source": "quora.com",
    "_version_": 1513817220025680000,
    "score": 0.73895097
  },
  {
    "query": "who are gandhi brothers",
    "id": "6",
    "source": "quora.com",
    "_version_": 1513817220027777000,
    "score": 0.73895097
  },
  {
    "query": "who is person know as gandhi",
    "id": "2",
    "source": "quora.com",
    "_version_": 1513817220023582700,
    "score": 0.5911608
  }
]

} }