Question

我们正在运行一个版本为2.3.1的ES节点。

在上周（168小时），有一个查询基本上返回每日存储桶中的一些唯一用户（用户ID字段的唯一值）。此查询会影响8个索引。

过去，这样的查询运行得很快。它随着时间的推移变得越来越慢，但现在我们受到拒绝，无法弄清楚原因。我们发现当我们运行此查询时search.queue会立即填满 - 它会大约350，然后是640，然后是1000，并且拒绝进入（这些步骤在查询运行时会在几秒钟内发生）

我不明白这是怎么可能的，因为它应该只影响8个索引，每个索引有2个分片，并且它在过去很好用。

查询是：

GET /abcdefg-2016.09.05%2Cabcdefg-2016.09.06%2Cabcdefg-2016.09.07%2Cabcdefg-2016.09.08%2Cabcdefg-2016.09.09%2Cabcdefg-2016.09.10%2Cabcdefg-2016.09.11%2Cabcdefg-2016.09.12/abcdefg/_search
{
  "sort": {},
  "from": 0,
  "size": 0,
  "fields": [
    "*",
    "_source",
    "_field_names"
  ],
  "fielddata_fields": [
    "@timestamp"
  ],
  "query": {
    "filtered": {
      "filter": {
        "bool": {
          "must": [
            {
              "range": {
                "@timestamp": {
                  "gte": 1473073782735,
                  "lte": 1473678582735
                }
              }
            },
            {
              "missing": {
                "field": "demo"
              }
            }
          ],
          "must_not": [],
          "should": []
        }
      }
    }
  },
  "aggs": {
    "date_histogram": {
      "date_histogram": {
        "field": "@timestamp",
        "interval": "1d",
        "min_doc_count": 0,
        "extended_bounds": {
          "min": 1473073782735,
          "max": 1473678582735
        }
      },
      "aggs": {
        "unique_users_count": {
          "cardinality": {
            "field": "usedUID"
          }
        }
      }
    },
    "unique_users_count": {
      "cardinality": {
        "field": "usedUID"
      }
    }
  }
}

查询卡住时运行curl localhost:9200/_cat/thread_pool?v显示：

host      ip        bulk.active bulk.queue bulk.rejected index.active index.queue index.rejected search.active search.queue search.rejected
127.0.0.1 127.0.0.1           0          0             0            0           0              0             4         1000          132228

它将保持这种状态几分钟，然后队列将回到零。

可能是什么问题？

编辑：添加个人资料：true会提供此输出：http://pastebin.com/s4jpw36d

EDIT2：最奇怪的是，在配置文件输出中，我看到ES向Lucene发送了大量这些奇怪的查询：

  {
    "query_type": "BooleanQuery",
    "lucene": "@timestamp:0 \u0000\u0000\n[xD @timestamp:0 \u0000\u0000\n[xE @timestamp:0 \u0000\u0000\n[xF @timestamp:0 \u0000\u0000\n[xG @timestamp:0 \u0000\u0000\n[xH @timestamp:0 \u0000\u0000\n[xI @timestamp:0 \u0000\u0000\n[xJ @timestamp:0 \u0000\u0000\n[xK @timestamp:0 \u0000\u0000\n[xL @timestamp:0 \u0000\u0000\n[xM @timestamp:0 \u0000\u0000\n[xN @timestamp:0 \u0000\u0000\n[xO @timestamp:0 \u0000\u0000\n[xP @timestamp:0 \u0000\u0000\n[xQ",
    "time": "61.40814100ms",
    "breakdown": {
      "score": 0,
      "create_weight": 357521,
      "next_doc": 40988029,
      "match": 0,
      "build_scorer": 2733654,
      "advance": 0
    },
    "children": [
      {
        "query_type": "TermQuery",
        "lucene": "@timestamp:0 \u0000\u0000\n[xD",
        "time": "0.1429700000ms",
        "breakdown": {
          "score": 0,
          "create_weight": 21940,
          "next_doc": 99164,
          "match": 0,
          "build_scorer": 21866,
          "advance": 0
        }
      },
      {
        "query_type": "TermQuery",
        "lucene": "@timestamp:0 \u0000\u0000\n[xE",
        "time": "0.5797620000ms",
        "breakdown": {
          "score": 0,
          "create_weight": 64810,
          "next_doc": 501767,
          "match": 0,
          "build_scorer": 13185,
          "advance": 0
        }
      }, 
      ...
    ]
  }

EDIT3：好的，这似乎是有意的：https://discuss.elastic.co/t/es-rewriting-range-to-timestamp-to-booleanquery-termquery-why/56363 - 但是对我来说没有意义，它现在会使查询无法使用，阻塞其他空的队列...

Answer 1

Re： EDIT3。我预感到优化中存在错误（“重写范围......”）

在决定是否将范围查询重写为布尔值时，切换代码会查看TermsEnum MultiTermQueryConstantScoreWrapper.java:147。如果TermsEnum.next()返回null（我相信它会，当字段没有术语向量时），那么collectTerms方法会返回true（并且会重写查询作为布尔查询...即使没有术语向量！）

通过在查询中从字段缓存中提取@timestamp，您正在做一些非标准的事情：

  "fielddata_fields": [
    "@timestamp"
  ],

您正在使用fielddata变通方法这一事实表明您可能没有在时间戳字段中存储术语信息（无论如何为什么？！），这是范围优化所期望的。但是，您正在从字段缓存中传递对查询时重建的字段的引用（这可能不是预先存在的测试覆盖范围）。

作为解决方法和常规调优改进，我确保您没有为时间戳字段（在索引映射中）禁用docvalues，然后直接在聚合中引用您的时间戳字段（{ {1}} vs timestamp）。如果您在映射中明确禁用了@timestamp的docvalues，那么您必须重新索引旧数据，或者等到索引更改前滚到足以使查询成功（所有干净索引）。 / p>

Docvalues 是目前聚合的最佳做法。对于支持它们的字段类型，默认情况下它们在ES 2.0+中启用，并且它们可以免除许多性能问题w.r.t.聚合（并且可能还可以避免任何意外的“优化”头痛！）

这是一篇很好的文章，讨论了字段缓存引入的规模问题以及为什么要使用docvalues：https://www.elastic.co/blog/support-in-the-wild-my-biggest-elasticsearch-problem-at-scale

只有8个索引的ElasticSearch查询已经填满了search.queue（容量1000）

1 个答案: