Question

我试图根据文档标题构建自动建议。如果用户键入“南方”，则自动建议将建议“韩国”＆＃39;例如。我使用 shingle 过滤器将标题分为两个单词。这是我的映射：

{
   "settings":{
      "analysis":{
         "filter":{
            "suggestions_shingle":{
               "type":"shingle",
               "min_shingle_size":2,
               "max_shingle_size":2
            }
         },
         "analyzer":{
            "suggestions":{
               "tokenizer":"standard",
               "filter":[
                  "suggestions_shingle"
               ]
            }
         }
      }
   },
   "mappings":{
      "docs":{
         "properties":{
            "docs_title":{
               "type":"multi_field",
               "fields":{
                  "docs_title":{
                     "type":"string"
                  },
                  "suggestions":{
                     "type":"string",
                     "analyzer":"suggestions",
                     "search_analyzer":"simple"
                  }
               }
            }
         }
      }
   }
}

这是我的疑问：

{
   explain:true,
   "aggs":{
      "description_suggestions":{
         "terms":{
            "field":"docs_title.suggestions",
            "size":10,
            "include":"South .*"
         }
      }
   },
   size:0
}

以下是查询的回复：

{
    "took": 2764,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 453526,
        "max_score": 0,
        "hits": []
    },
    "aggregations": {
        "description_suggestions": {
            "doc_count_error_upper_bound": 10,
            "sum_other_doc_count": 2363,
            "buckets": [
                {
                    "key": "South Korea",
                    "doc_count": 274
                },
                {
                    "key": "South India",
                    "doc_count": 179
                },
                {
                    "key": "South Carolina",
                    "doc_count": 179
                }
            ]
        }
    }
}

如您所见，查询完成了2764次。如何加快查询速度？

我正在考虑仅在最后的2000个文档上运行聚合查询，以通过使用过滤器加快速度。我注意到Elastic Search忽略了过滤器，它在所有文档上运行aggs。这是查询：

{
   explain:true,
   "aggs":{
      "recent_suggestions":{
         "filter":{
            "range":{
               "docs_date":{
                  "gte":1453886958
               }
            }
         },
         "aggs":{
            "description_suggestions":{
               "terms":{
                  "field":"docs_title.suggestions",
                  "size":10,
                  "include":"South .*"
               }
            }
         }
      }
   },
   size:0
}

以下是回复：

{
    "took": 2216,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 453526,
        "max_score": 0,
        "hits": []
    },
    "aggregations": {
        "recent_suggestions": {
            "doc_count": 27240,
            "description_suggestions": {
                "doc_count_error_upper_bound": 0,
                "sum_other_doc_count": 173,
                "buckets": [
            {
                    "key": "South Korea",
                    "doc_count": 19
                },
                {
                    "key": "South India",
                    "doc_count": 17
                },
                {
                    "key": "South Carolina",
                    "doc_count": 17
                }
                ]
            }
        }
    }
}

如您所见，总命中率是相同的。

如何让这两个查询更快？

我在单个实例上使用AWS ElasticSaerch v1.5.2和Lucene v4.10.4。

Answer 1

这里的问题是所有文件被认为是非常昂贵的聚合，因此需要花费很多时间。

1）首先查询：

{
  "query": {
    "match": {
      "docs_title": "south"
    }
  },
  "aggs": {
    "unique": {
      "terms": {
        "field": "docs_title.suggestions",
        "size": 10,
        "include": "(?i)south .*",
        "execution_hint": "map"
      }
    }
  },
  "size": 0
}

我们只考虑其中包含south的文档进行聚合。您没有指定任何查询，默认情况下它是match all查询。我还在(?i)中添加了include不区分大小写的标记，以便它匹配韩国和韩国。

2）第二次查询：

我们需要缩小满足我们聚合标准的文档集。

{
  "query": {
    "filtered": {
      "query": {
        "match": {
          "docs_title": "south"
        }
      },
      "filter": {
        "range": {
          "docs_date": {
            "gte": 1453886958
          }
        }
      }
    }
  },
  "aggs": {
    "unique": {
      "terms": {
        "field": "docs_title.suggestions",
        "size": 10,
        "include": "(?i)south .*",
        "execution_hint": "map"
      }
    }
  },
  "size": 0
}

最近的文档过滤应该在查询内部进行，而不是在这种情况下进行聚合。

你现在应该看到相当大的差异。以前聚合是在450K文档上完成的，现在它应该小得多。

EDIT1 ：This issue提供了有关include/exclude <{1}}成本高{+ 1}} doc_title.suggestions成本高的详细信息"execution_hint": "map"是什么（带状疱疹增加）这更多）。 @markharwood评论了这个问题

根本原因是IncludeExclude.acceptedGlobalOrdinals（）方法急切地为索引中的所有术语枚举术语而不是懒惰地为结果集中的人。对于高基数领域这个可能需要很长时间

基本上，aggs正在浏览索引中的所有术语。解决方案是在聚合中使用

<script>
    $(document).ready(function(){
        $('.send').attr('disabled',true);

        $('#kagawad').keyup(function(){
            if($(this).val() != ""){
                $('.send').attr('disabled', false);
            }
            else
            {
                $('.send').attr('disabled', true);        
            }
        })
    });
</script>

，这将避免加载全局序数。 More就此而言。也没有100％的保证。来自文档

请注意，Elasticsearch会忽略此执行提示不适用，没有向后兼容性保证这些提示。

当少数文档与查询匹配时，将考虑这种情况。

注意：这可能完全不相关，但您可能需要查看completion suggester，但它仅在字符串以特定字母开头时才有效。

如何加快这个ElasticSearch查询？

1 个答案: