术语聚合的子聚合的准确性

时间:2020-05-25 09:08:24

标签: elasticsearch query-performance elasticsearch-aggregation

我在每个商场都有每日统计记录,其字段如下:

  • cpnTotalCount
  • orderTotalCount
  • orderTime
  • mallId
  • cpnTotalAmount

有两个字段,我将使用bucket_script来获得比率cpnTotalCount / orderTotalCount,并使用bucket_sort来获得 topK

但是,如果我只选择7天才能到达 topK 购物中心,由于doc_count_error_upper_bound

,我将无法获得准确的结果
术语汇总中的

文档计数(以及任何子汇总的结果)并不总是准确的。每个分片都提供自己的术语顺序列表视图。这些视图结合在一起可以得出最终视图。

是否有其他方法可以在“准确性”和“性能”之间实现更好的平衡。

任何帮助将不胜感激;)


{
  "size": 10,
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "orderTime": {
              "from": 1589385600000,
              "to": 1590249599999,
              "include_lower": true,
              "include_upper": true,
              "boost": 1.0
            }
          }
        },
        {
          "range": {
            "cpnTotalCount": {
              "from": 3,
              "to": null,
              "include_lower": true,
              "include_upper": true,
              "boost": 1.0
            }
          }
        }
      ],
      "adjust_pure_negative": true,
      "boost": 1.0
    }
  },
  "aggs": {
    "es_aggs_bucketing": {
      "terms": {
        "field": "mallId",
        "size": 20,
        "shard_size": 10000,
        "min_doc_count": 1,
        "shard_min_doc_count": 0,
        "show_term_doc_count_error": false,
        "order": [
          {
            "_count": "desc"
          },
          {
            "_key": "asc"
          }
        ]
      },
      "aggregations": {
        "es_aggs_count_one": {
          "sum": {
            "field": "cpnTotalCount"
          }
        },
        "es_aggs_count_two": {
          "sum": {
            "field": "orderTotalCount"
          }
        },
        "es_aggs_sum_one": {
          "sum": {
            "field": "cpnTotalAmount"
          }
        },
        "es_aggs_script": {
          "bucket_script": {
            "buckets_path": {
              "orderCount": "es_aggs_count_two",
              "couponCount": "es_aggs_count_one"
            },
            "script": {
              "source": "params.couponCount/params.orderCount",
              "lang": "painless"
            },
            "gap_policy": "skip"
          }
        },
        "sort": {
          "bucket_sort": {
            "sort": [
              {
                "es_aggs_script": {
                  "order": "desc"
                }
              }
            ],
            "from": 0,
            "size": 40,
            "gap_policy": "SKIP"
          }
        }
      }
    }
  }
}

1 个答案:

答案 0 :(得分:0)

如果数据集不是很大,就我而言,它可能会在一年内达到150GB,所以我正在尝试

  • 使用30分片来保存购物中心级别的记录和
  • mallIdrouting绑定以确保每个购物中心级别的比率都是准确的