Question

我正在为弹性搜索索引指标名称。度量标准名称的格式为foo.bar.baz.aux。这是我使用的索引。

{
    "index": {
        "analysis": {
            "analyzer": {
                "prefix-test-analyzer": {
                    "filter": "dotted",
                    "tokenizer": "prefix-test-tokenizer",
                    "type": "custom"
                }
            },
            "filter": {
                "dotted": {
                    "patterns": [
                        "([^.]+)"
                    ],
                    "type": "pattern_capture"
                }
            },
            "tokenizer": {
                "prefix-test-tokenizer": {
                    "delimiter": ".",
                    "type": "path_hierarchy"
                }
            }
        }
    }
}

{
    "metrics": {
        "_routing": {
            "required": true
        },
        "properties": {
            "tenantId": {
                "type": "string",
                "index": "not_analyzed"
            },
            "unit": {
                "type": "string",
                "index": "not_analyzed"
            },
            "metric_name": {
                "index_analyzer": "prefix-test-analyzer",
                "search_analyzer": "keyword",
                "type": "string"
            }
        }
    }
}

上述索引为指标名称foo.bar.baz

创建以下术语

foo
bar
baz
foo.bar
foo.bar.baz

如果我有大量指标，如下所示

a.b.c.d.e
a.b.c.d
a.b.m.n
x.y.z

我必须编写一个查询来获取第n级令牌。在上面的例子中

for level = 0, I should get [a, x] 
for level = 1, with 'a' as first token I should get [b]
               with 'x' as first token I should get [y]  
for level = 2, with 'a.b' as first token I should get [c, m]

除了编写术语聚合之外，我无法想到任何其他方式。要找出a.b的第2级令牌，这是我提出的查询。

time curl -XGET http://localhost:9200/metrics_alias/metrics/_search\?pretty\&routing\=12345 -d '{
      "size": 0,
      "query": {
        "term": {
            "tenantId": "12345"
        }
      },
      "aggs": {
          "metric_name_tokens": {
              "terms": {
                  "field" : "metric_name",
                  "include": "a[.]b[.][^.]*",
                  "execution_hint": "map",
                  "size": 0
              }
          }
      }
  }'

这将导致以下存储桶。我解析输出并从那里抓住[c，m]。

"buckets" : [ {
     "key" : "a.b.c",
     "doc_count" : 2
   }, {
     "key" : "a.b.m",
     "doc_count" : 1
 } ]

到目前为止一切顺利。该查询适用于大多数租户（请注意上面的tenantId term查询）。对于拥有大量数据（约1密尔）的某些租户而言，性能非常缓慢。我猜所有聚合都需要时间。

我想知道术语聚合是否是这类数据的正确选择，也在寻找其他可能的查询类型。

Answer 1

一些建议：

＆＃34;反射镜＆＃34;也是查询部分中聚合级别的过滤器。因此，对于a.b.匹配，请使用以下内容作为查询并保留相同的aggs部分：

"bool": {
  "must": [
    {
      "term": {
        "tenantId": 123
      }
    },
    {
      "prefix": {
        "metric_name": {
          "value": "a.b."
        }
      }
    }
  ]
}

甚至使用与聚合部分中相同的正则表达式的regexp。通过这种方式，聚合将不得不评估更少的桶，因为到达聚合部分的文档将更少。您提到regexp对您来说效果更好，我最初的猜测是prefix表现更好。

将"size": 0从聚合更改为"size": 100。经过测试，你提到这并没有任何区别
删除"execution_hint": "map"并让Elasticsearch使用默认值。经过测试，您提到默认execution_hint的表现要差得多。
我唯一能想到的是通过在索引时移动它来减轻搜索时间的压力。我的意思是：在索引时，在您自己的应用程序或您正在使用的任何索引方法中，将文本拆分为programaticaly（而不是ES），并在单独的字段中索引层次结构中的每个元素。例如a.b中的field2，a.b.c中的field3，依此类推。这是同一份文件。然后，在搜索时，您将根据搜索文本的内容查看特定字段。但是，这个想法需要在ES之外进行一些额外的工作。

根据上述所有建议，第一个建议影响最大：查询响应时间从23秒提高到11秒。

术语聚合（实现分层分面）查询性能缓慢

1 个答案: