Question

使用ElasticSearch v1.7.2和一个相当大的索引，我在以下两个搜索中获得了不同的文档计数，这些搜索在query_string中使用模糊搜索。

查询：

{
  "query": {
     "query_string": {
        "query": "rapt~4"
     }
  }
}

过滤器：

{
 "filter": {
    "query": {
       "query_string": {
          "query": "rapt~4"
       }
    }
 }
}

过滤器提供比查询大约5％更多的结果。为什么文件计数不同？我可以指定哪些选项使它们保持一致吗？

请注意，仅当我使用中等大小的数据集时才会出现此不一致。我已经尝试将几个（＆lt; 10）与过滤器匹配但未将查询匹配的文档插入到一个干净的集群中，之后我的查询和我的过滤器成功地做匹配所有文档。但是，在单个索引，单个类型和几百个文档的集群中，我开始看到这种差异。

使用explain = true选项，似乎使用Practical Scoring Function计算查询分数。该解释提供有关boost，queryNorm，idf和术语权重的信息。相反，过滤器解释仅报告实用评分函数的boost和queryNorm组件，而不是idf或term weight。

带有解释的答案示例如下。请注意，我已从我的示例匹配中删除了许多字段并简化了内容，因此解释中的术语频率将与匹配的字词之外的实际内容不匹配（在这种情况下＆＃34;事实＆＃34;）。这些响应适用于相同的事件。我的问题是过滤器响应中包含的额外匹配不包含在查询响应中。他们的解释看起来完全相同。

查询：

curl -XPOST "http://localhost:9200/index-name/example-type/_search" -H "Content-Type: application/json" -d'{"query":{"query_string":{"query":"rapt~"}},"explain":true}'

查询回复：

{
"_source": {
  "type": "example",
  "content": "to the fact that"
},
"_explanation": {
  "value": 0.10740301,
  "description": "sum of:",
  "details": [
    {
      "value": 0.10740301,
      "description": "weight(_all:fact^0.5 in 465) [PerFieldSimilarity], result of:",
      "details": [
        {
          "value": 0.10740301,
          "description": "score(doc=465,freq=2.0), product of:",
          "details": [
            {
              "value": 0.11091774,
              "description": "queryWeight, product of:",
              "details": [
                {
                  "value": 0.5,
                  "description": "boost"
                },
                {
                  "value": 7.303468,
                  "description": "idf(docFreq=68, maxDocs=37706)"
                },
                {
                  "value": 0.03037399,
                  "description": "queryNorm"
                }
              ]
            },
            {
              "value": 0.96831226,
              "description": "fieldWeight in 465, product of:",
              "details": [
                {
                  "value": 1.4142135,
                  "description": "tf(freq=2.0), with freq of:",
                  "details": [
                    {
                      "value": 2,
                      "description": "termFreq=2.0"
                    }
                  ]
                },
                {
                  "value": 7.303468,
                  "description": "idf(docFreq=68, maxDocs=37706)"
                },
                {
                  "value": 0.09375,
                  "description": "fieldNorm(doc=465)"
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}
}

过滤器：

curl -XPOST "http://localhost:9200/index-name/example-type/_search" -H "Content-Type: application/json" -d'{"query":{"filtered":{"filter":{"fquery":{"query":{"query_string":{"query":"rapt~"}}}}}},"explain":true}'

并过滤回复：

{
"_source": {
  "type": "example",
  "content": "to the fact that"
},
"_explanation": {
  "value": 1,
  "description": "ConstantScore(cache(+_type:example-type +org.elasticsearch.index.search.nested.NonNestedDocsFilter@737a6633)), product of:",
  "details": [
    {
      "value": 1,
      "description": "boost"
    },
    {
      "value": 1,
      "description": "queryNorm"
    }
  ]
}
}

当我将过滤器包装在一个常量分数查询中时，我得到与过滤器完全相同的结果集（再次，超过查询），但解释看起来更清晰：

Constant-score query wrapped filter：

curl -XPOST "http://localhost:9200/index-name/example-type/_search" -H "Content-Type: application/json" -d'{"query":{"constant_score":{"filter":{"query":{"query_string":{"query":"rapt~"}}}}},"explain":true}'

并且常量得分查询包装过滤器响应：

{
"_source": {
  "type": "example",
  "content": "to the fact that"
},
"_explanation": {
  "value": 1,
  "description": "ConstantScore(QueryWrapperFilter(_all:rapt~2)), product of:",
  "details": [
    {
      "value": 1,
      "description": "boost"
    },
    {
      "value": 1,
      "description": "queryNorm"
    }
  ]
}
}

因为过滤器返回更多结果而不是查询，我的猜测是实用评分函数最终得分与分数匹配的文档得分为0.但是，对于文档＆＃ 34;匹配＆＃34;查询，评分函数的任何组件都不应为零。

编辑：我在238个文档的小集群上重新创建了这个问题（请注意，文档的内容是从维基百科文本训练的ngram语言模型生成的。）。我已经在保管箱上发布了cluster和json events。要查看此数据的问题，请运行以下查询，该查询将返回id = 138的事件：

{
 "explain": true,
 "query": {
    "bool": {
       "must_not": [
          {
             "query_string": {
                "query": "rap~",
                "fields": [
                   "body"
                ]
             }
          }
       ],
       "must": [
          {
             "constant_score": {
                "filter": {
                   "query": {
                      "query_string": {
                         "query": "rap~",
                         "fields": [
                            "body"
                         ]
                      }
                   }
                }
             }
          }
       ]
    }
 }
}

Answer 1

在 Elasticsearch 5.x之前的Elasticsearch 版本中，顶层的filter表示post_filter。过滤器通常仅在使用聚合时才相关。

从Elasticsearch 5.0（及更高版本）开始，您必须明确说出post_filter以避免这种混淆。

因此，不同之处在于您的顶级查询实际上将结果限制为一组匹配的文档。后置过滤器有效地匹配所有内容，然后仅从 hits 中删除结果，而不会影响计数。

...似乎使用...
计算查询得分

查询总是计算得分，它们旨在帮助根据项目的相关性（得分）对项目进行正确排序。过滤器从不计算分数;过滤器适用于纯粹的布尔逻辑，不会影响包含/排除之外的“相关性”。

公平地说，您可以在Elasticsearch 1.x中以多种方式将任何查询转换为过滤器（在2.x中，所有查询也在正确的上下文中过滤！），但我倾向于使用{{1} }。如果你这样做，那么你应该得到相同的结果：

作为查询：

fquery

作为过滤器：

{ "query": { "query_string": { "query": "rapt~" } } }

在ES 2.x中，过滤器也简化了（并且查询保持不变）：

{ "query": { "filtered": { "filter": { "fquery": { "query": { "query_string": { "query": "rapt~" } } } } } } }

当使用lucene模糊运算符时，Elasticsearch查询和过滤器会给出不同的doc计数

1 个答案: