弹性搜索使用其他过滤器获得最高分组总和(Elasticsearch版本5.3)

时间:2017-08-11 11:26:53

标签: elasticsearch querydsl

这是我的 Mapping

{
    "settings" : {
        "number_of_shards" : 2,
        "number_of_replicas" : 1
    },
    "mappings" :{
        "cpt_logs_mapping" : {
            "properties" : {
                "channel_id" : {"type":"integer","store":"yes","index":"not_analyzed"},
                "playing_date" : {"type":"string","store":"yes","index":"not_analyzed"},
                "country_code" : {"type":"text","store":"yes","index":"analyzed"},
                "playtime_in_sec" : {"type":"integer","store":"yes","index":"not_analyzed"},
                "channel_name" : {"type":"text","store":"yes","index":"analyzed"},
                "device_report_tag" : {"type":"text","store":"yes","index":"analyzed"}
            }
        }
    }
}

我想使用以下 MySQL 查询查询类似于我的方式的索引:

SELECT 
  channel_name,
  SUM(`playtime_in_sec`) as playtime_in_sec 
FROM
  channel_play_times_bar_chart
WHERE
country_code = 'country' AND 
device_report_tag = 'device' AND
channel_name = 'channel' 
playing_date BETWEEN 'date_range_start' AND 'date_range_end' 
GROUP BY channel_id
ORDER BY SUM(`playtime_in_sec`) DESC
LIMIT 30;

到目前为止,我的 QueryDSL 看起来像这样

{
  "size": 0,
  "aggs": {
    "ch_agg": {
      "terms": {
        "field": "channel_id",
        "size": 30 ,
        "order": {
              "sum_agg": "desc"
        }
      },
      "aggs": {
        "sum_agg": {
          "sum": {
            "field": "playtime_in_sec"
          }
        }
      }
    }
  }
}

问题1  虽然我所做的 QueryDSL 确实给我带来了前30个channel_ids w.r.t播放时间,但我很困惑如何在搜索范围内添加其他过滤器,即country_code,device_report_tag& playing_date。

问题2  另一个问题是,结果集仅包含 channel_id 和播放时间字段,而不像 MySQL 结果集,它会返回channel_name和playtime_in_sec列。这意味着我想使用channel_id字段实现聚合,但结果集应该返回该组的相应channel_name名称。

NOTE :此处的效果是首要任务,因为它应该在查询数百万甚至更多文档的图形生成器后面运行。

测试数据

hits: [
    {
        _index: "cpt_logs_index",
        _type: "cpt_logs_mapping",
        _id: "",
        _score: 1,
        _source: {
            ChID: 1453,
            playtime_in_sec: 35,
            device_report_tag: "mydev",
            channel_report_tag: "Sony Six",
            country_code: "SE",
            @timestamp: "2017-08-11",
        }
    },
    {
        _index: "cpt_logs_index",
        _type: "cpt_logs_mapping",
        _id: "",
        _score: 1,
        _source: {
            ChID: 145,
            playtime_in_sec: 25,
            device_report_tag: "mydev",
            channel_report_tag: "Star Movies",
            country_code: "US",
            @timestamp: "2017-08-11",
        }
    },
    {
        _index: "cpt_logs_index",
        _type: "cpt_logs_mapping",
        _id: "",
        _score: 1,
        _source: {
            ChID: 12,
            playtime_in_sec: 15,
            device_report_tag: "mydev",
            channel_report_tag: "HBO",
            country_code: "PK",
            @timestamp: "2017-08-12",
        }
    }
]

1 个答案:

答案 0 :(得分:0)

问题1:

您是否要在上面的示例中添加过滤器/查询?如果是这样,您只需添加一个"查询"节点到查询文档:

{
  "size": 0,
  "query":{
    "bool":{
        "must":[
            {"terms": { "country_code": ["pk","us","se"] } },
            {"range": { "@timestamp": { "gt": "2017-01-01", "lte": "2017-08-11"  } } }
            ]
    }
  },
  "aggs": {
    "ch_agg": {
      "terms": {
        "field": "ChID",
        "size": 30
      },
      "aggs":{
        "ch_report_tag_agg": {
            "terms":{
                "field" :"channel_report_tag.keyword"
            },
            "aggs":{
                "sum_agg":{
                    "sum":{
                    "field":"playtime_in_sec"
                    }
                }
            }
        }
      }
    }
  }
}

在开始聚合之前,您可以使用弹性的所有常规查询/过滤器预先过滤搜索(关于性能,elasticsearch将在开始聚合之前应用任何过滤器/查询,因此您可以在此处执行的任何过滤都会有很大帮助)

问题2:

在我的头脑中,我建议使用两种解决方案中的一种(除非我并没有完全误解这个问题):

  1. 按照要向下钻取的顺序为输出中的所需字段添加aggs级别。 (您可以非常深入地在aggs中嵌入aggs而不会出现问题,并获得每个级别的计数奖励)

  2. 在"最低"上使用top_hits聚合aggs的级别,并使用" _source":{" include":[/ fields /]}

    指定输出中您想要的字段
  3. 您能提供一些测试数据记录吗?

    此外,了解您正在运行的ElasticSearch版本非常有用,因为主要版本之间的语法和行为会发生很大变化。