我正在寻找一种解决客户跨场访问者报告的方法,他想要一个HTTP API,该API可返回其在一天范围内访问了多家商店的客户的唯一身份总数(该API必须在1-2秒)。
原始数据样本(实际上有数百万条记录):
--------------------------
DAY | CUSTOMER | VENUE
--------------------------
1 | cust_1 | A
2 | cust_2 | A
3 | cust_1 | B
3 | cust_2 | A
4 | cust_1 | C
5 | cust_3 | C
6 | cust_3 | A
现在,我想计算交叉访问者报告。 IMO的步骤如下:
第1步:汇总第1天到第6天的原始数据
--------------------------
CUSTOMER | VENUE VISIT
--------------------------
cus_1 | [A, B, C]
cus_2 | [A]
cus_3 | [A, C]
第2步:得出最终结果
Total unique cross-customer: 2 (cus_1 and cus_3)
我已经尝试了一些解决方案:
请问给我一个解决我问题的线索。
感谢进阶!
答案 0 :(得分:0)
您可以通过使用一个date_histogram
聚合每天进行存储,两个terms
聚合(首先使用customer
然后使用venue
进行存储)和Elasticsearch轻松地做到这一点然后仅使用bucket_selector
pipeline aggregation选择在任何一天访问了多个地点的客户。看起来像这样:
POST /sales/_search
{
"size": 0,
"aggs": {
"by_day": {
"date_histogram": {
"field": "date",
"interval": "day"
},
"aggs": {
"customers": {
"terms": {
"field": "customer.keyword"
},
"aggs": {
"venues": {
"terms": {
"field": "venue.keyword"
}
},
"cross_selector": {
"bucket_selector": {
"buckets_path": {
"venues_count": "venues._bucket_count"
},
"script": {
"source": "params.venues_count > 1"
}
}
}
}
}
}
}
}
}
在结果集中,您将获得预期的客户1和3。
更新:
另一种方法涉及使用scripted_metric
aggregation以便自己实现逻辑。这有点复杂,并且可能会根据您拥有的文档和硬件的数量而不能很好地执行,但是以下算法将完全按照您的期望产生响应2:
POST sales/_search
{
"size":0,
"aggs": {
"unique": {
"scripted_metric": {
"init_script": "params._agg.visits = new HashMap()",
"map_script": "def cust = doc['customer.keyword'].value; def venue = doc['venue.keyword'].value; def venues = params._agg.visits.get(cust); if (venues == null) { venues = new HashSet(); } venues.add(venue); params._agg.visits.put(cust, venues)",
"combine_script": "def merged = new HashMap(); for (v in params._agg.visits.entrySet()) { def cust = merged.get(v.key); if (cust == null) { merged.put(v.key, v.value) } else { cust.addAll(v.value); } } return merged",
"reduce_script": "def merged = new HashMap(); for (agg in params._aggs) { for (v in agg.entrySet()) {def cust = merged.get(v.key); if (cust == null) {merged.put(v.key, v.value)} else {cust.addAll(v.value); }}} def unique = 0; for (m in merged.entrySet()) { if (m.value.size() > 1) unique++;} return unique"
}
}
}
}
响应:
{
"took": 1413,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 7,
"max_score": 0,
"hits": []
},
"aggregations": {
"unique": {
"value": 2
}
}
}