Question

我在弹性搜索方面遇到了一些问题...我设法在我的机器上创建了一个可重复的示例，代码位于帖子的末尾。

我只创建了6个用户，"Roger Sand"，"Roger Gilbert"，"Cindy Sand"，"Cindy Gilbert"，"Jean-Roger Sands"，"Sand Roger"，并将其编入索引名。

然后我运行一个查询来匹配“Roger Sand”，并显示相关的分数。

这是同一个脚本的执行，有两组不同的ID：84046到84051和84047到84052（只是移动了1）。

结果的顺序不一样，得分也不一样：

执行84046 ... 84051

Sand Roger => 0.8838835
Roger Sand => 0.2712221
Cindy Sand => 0.22097087
Jean-Roger Sands => 0.17677669
Roger Gilbert => 0.028130025

使用84047..84052执行

Roger Sand => 0.2712221
Sand Roger => 0.2712221
Cindy Sand => 0.22097087
Jean-Roger Sands => 0.17677669
Roger Gilbert => 0.15891947

我的问题是为什么“id”会对通过“full_name”的搜索产生影响？

以下是可重现脚本的完整ruby代码。

first_id = 84046 # Or 84047
client = Elasticsearch::Client.new(:log => true)
client.transport.reload_connections!
client.indices.delete({:index => 'test'})
client.indices.create({ :index => 'test' })
client.perform_request('POST', 'test/_refresh')

["Roger Sand", "Roger Gilbert", "Cindy Sand", "Cindy Gilbert", "Jean-Roger Sands", "Sand  Roger" ].each_with_index do |name, i|
  i2 = first_id + i
  client.create({
    :index => 'test', :type => 'user',
    :id => i2,
    :body => { :full_name => name }
  })
end

query_options = {
  :type => 'user', :index => 'test',
  :body => {
    :query => { :match => { :full_name => "Roger Sand" } } 
  }
}

client.perform_request('POST', 'test/_refresh')

client.search(query_options)["hits"]["hits"].each do |hit|
  $stderr.puts "#{hit["_source"]["full_name"]} => #{hit["_score"]}"
end

这是一个命令行

curl -XDELETE 'http://localhost:9200/test' 
curl -XPUT 'http://localhost:9200/test' 
curl -XPOST 'http://localhost:9200/test/_refresh' 
curl -XPUT 'http://localhost:9200/test/user/84047?op_type=create' -d '{"full_name":"Roger Sand"}'
curl -XPUT 'http://localhost:9200/test/user/84048?op_type=create' -d '{"full_name":"Roger Gilbert"}'
curl -XPUT 'http://localhost:9200/test/user/84049?op_type=create' -d '{"full_name":"Cindy Sand"}'
curl -XPUT 'http://localhost:9200/test/user/84050?op_type=create' -d '{"full_name":"Cindy Gilbert"}'
curl -XPUT 'http://localhost:9200/test/user/84051?op_type=create' -d '{"full_name":"Jean-Roger Sands"}'
curl -XPUT 'http://localhost:9200/test/user/84052?op_type=create' -d '{"full_name":"Sand Roger"}'
curl -XPOST 'http://localhost:9200/test/_refresh' 
curl -XPOST 'http://localhost:9200/test/user/_search?pretty' -d '{"query":{"match":{"full_name":"Roger Sand"}}}'


curl -XDELETE 'http://localhost:9200/test'
curl -XPUT 'http://localhost:9200/test'
curl -XPOST 'http://localhost:9200/test/_refresh'
curl -XPUT 'http://localhost:9200/test/user/84046?op_type=create' -d '{"full_name":"Roger Sand"}'
curl -XPUT 'http://localhost:9200/test/user/84047?op_type=create' -d '{"full_name":"Roger Gilbert"}'
curl -XPUT 'http://localhost:9200/test/user/84048?op_type=create' -d '{"full_name":"Cindy Sand"}'
curl -XPUT 'http://localhost:9200/test/user/84049?op_type=create' -d '{"full_name":"Cindy Gilbert"}'
curl -XPUT 'http://localhost:9200/test/user/84050?op_type=create' -d '{"full_name":"Jean-Roger Sands"}'
curl -XPUT 'http://localhost:9200/test/user/84051?op_type=create' -d '{"full_name":"Sand Roger"}'
curl -XPOST 'http://localhost:9200/test/_refresh'
curl -XPOST 'http://localhost:9200/test/user/_search?pretty' -d '{"query":{"match":{"full_name":"Roger Sand"}}}'

Answer 1

问题在于分布式分数计算。

使用默认设置创建新索引，即5个分片。每个分片都是自己的Lucene索引。当您索引数据时，Elasticsearch需要决定文档应该去哪个分片，并通过在_id上进行散列（没有路由参数）来实现。

因此，通过移动ID，您最终将文档分发到不同的分片。如上所述，每个分片都是自己的Lucene索引，当您搜索多个分片时，必须组合每个单独分片的不同分数，并且由于路由不同，各个分数也不同。

您可以通过在查询中添加explain来验证这一点。对于Sand Roger，idf分别计算为idf(docFreq=1, maxDocs=1) = 0.30685282和idf(docFreq=1, maxDocs=2) = 1，这会产生不同的结果。

您可以将分片大小更改为1，或将查询类型更改为dfs类型。搜索http://localhost:9200/test/user/_search?pretty&query_type=dfs_query_and_fetch将为您提供正确的分数，因为它的

初始散射阶段，用于计算分布式术语频率，以获得更准确的评分

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-search-type.html#dfs-query-and-fetch

Answer 2

评分总是会对一个小数据集和5个分片的默认Elasticsearch索引设置保持警惕。

对于像这样的测试使用带有单个分片的索引，或使用更大的数据集，因此跨分片的语料库分布更加均衡。

＆＃34; _id＆＃34;的影响在搜索方法中，elasticsearch中的字段？

2 个答案: