我对solr结果感到困惑。我有索引查询字段
schema.xml配置
window
我有6个文件的索引到solr
<field name="question" type="text_query" indexed="true" stored="true" multiValued="false"/>
<fieldType name="text_query" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
当我搜索 查询时:&#34; gandhi&#34; 我的搜索结果为
来自solr的回应
[
{
"query": "who is gandhi",
"source": "quora.com",
"id" : 1
},
{
"query": "who is person know as gandhi",
"source": "quora.com",
"id" : 2
},
{
"query": "who is Sachin",
"source": "quora.com",
"id" : 3
},
{
"query": "who is mahatma gandhi",
"source": "quora.com",
"id" : 4
},
{
"query": "who is gandhis",
"source": "quora.com",
"id" : 5
},
{
"query": "who are gandhi brothers",
"source": "quora.com",
"id" : 6
}
]
根据我的配置,我认为我应该在 maxscore
的顶部得到以下结果"response": {
"numFound": 4,
"start": 0,
"maxScore": 0.8048013,
"docs": [
{
"query": "who is gandhi btothers",
"id": "6",
"source": "quora.com",
"_version_": 1513810901444067300,
"score": 0.8048013
},
{
"query": "who is person know as gandhi",
"id": "2",
"source": "quora.com",
"_version_": 1513810901436727300,
"score": 0.643841
},
{
"query": "who is gandhi",
"id": "1",
"source": "quora.com",
"_version_": 1513810901428338700,
"score": 0.5945348
},
{
"query": "who is mahatma gandhi",
"id": "4",
"source": "quora.com",
"_version_": 1513810901431484400,
"score": 0.37158427
}
]
}
调试中的解释字段
{
"query": "who is gandhi",
"id": "1",
"source": "quora.com",
"_version_": 1513810901428338700,
"score": 0.5945348
}
但结果有所不同。为什么会这样?帮助感谢:)
答案 0 :(得分:1)
只是添加我认为的实际解释 - 如果您自己进行分片(又名&#34;传统模式&#34;),则分数在每个分片上自行计算。如果您有少量文档(或文档不是随机分布在分片中),则每个分片的分数可能与您在检索到最终结果时的分数完全不同。
这不是配置问题,只是在被响应节点合并之前在每个分片上计算分数的结果。
差异应该消失只要#2碎片不再比碎片#1多50%的文档(debugQuery输出中的maxdoc =值)。如果你有几百万份文件,两份文件的差异并不大,但是当分片与它们包含的内容之间有50%的差异时,这会对分数产生更大的影响。
请参阅&#34; Distributing Documents across Shards&#34;:
在传统分布式模式下,Solr不计算通用术语/ doc频率。对于大多数大规模实现,Solr在分片级别计算TD / IDF并不重要。但是,如果您的收藏在服务器上的分布严重偏差,您可能会在搜索中发现误导性的相关性结果。通常,最好将文档随机分发到您的分片。
如果您稍后切换到在SolrCloud模式下运行,那么计算的那部分应该是集合完成的,而不是本地的每个分片(因为它在传统模式下通过手动分片完成)。
答案 1 :(得分:0)
似乎问题在于评分算法。对于我的doc 1任何doc 2,我都没有相同的算法。不同算法的原因是多个分片。如果我使用单个分片,我会得到写入输出
"response": {
"numFound": 4,
"start": 0,
"maxScore": 1.1823215,
"docs": [
{
"query": "who is gandhi",
"id": "1",
"source": "quora.com",
"_version_": 1513817219957522400,
"score": 1.1823215
},
{
"query": "who is mahatma gandhi",
"id": "4",
"source": "quora.com",
"_version_": 1513817220025680000,
"score": 0.73895097
},
{
"query": "who are gandhi brothers",
"id": "6",
"source": "quora.com",
"_version_": 1513817220027777000,
"score": 0.73895097
},
{
"query": "who is person know as gandhi",
"id": "2",
"source": "quora.com",
"_version_": 1513817220023582700,
"score": 0.5911608
}
]
} }