在BigQuery中,当date 我的解决方法是: 输出: 所以我的问题是,可以用更有效的方法来计算吗?
还是对于大型数据集可以采用这种解决方案?+------+------------+------------------+
| Name | date | order_id | value |
+------+------------+----------+-------+
| JONES| 2019-01-03 | 11 | 10 |
| JONES| 2019-01-05 | 12 | 5 |
| JONES| 2019-06-03 | 13 | 3 |
| JONES| 2019-07-03 | 14 | 20 |
| John | 2019-07-23 | 15 | 10 |
+------+------------+----------+-------+
WITH data AS (
SELECT "JONES" name, DATE("2019-01-03") date_time, 11 order_id, 10 value
UNION ALL
SELECT "JONES", DATE("2019-01-05"), 12, 5
UNION ALL
SELECT "JONES", DATE("2019-06-03"), 13, 3
UNION ALL
SELECT "JONES", DATE("2019-07-03"), 14, 20
UNION ALL
SELECT "John", DATE("2019-07-23"), 15, 10
),
data2 AS (
SELECT *, MIN(date_time) OVER (PARTITION BY name) min_date
FROM data
)
SELECT name,
ARRAY_AGG(STRUCT(order_id as f_id, date_time as f_date) ORDER BY order_id LIMIT 1)[OFFSET(0)].*,
sum(case when date_time< date_add(min_date,interval 3 day) then value end) as total_value_day3,
SUM(value) AS total
FROM data2
GROUP BY name
+------+------+------------+----------------+------+
| name | f_id | f_date |total_value_day3| total|
+------+------+------------+----------------+------+
| JONES| 11 | 2019-01-03 | 15 | 38 |
| John | 15 | 2019-07-23 | 10 | 10 |
+------+------+------------+----------------+------+
答案 0 :(得分:0)
在不使用窗口函数或数组聚合的情况下,以下内容将获得相同的结果,因此BQ只需执行较少的排序/分区。对于这个小例子,我的查询需要更长的时间才能运行,但是字节改组更少。如果针对更大的数据集运行此方法,我认为我的效率会更高。
WITH data AS (
SELECT "JONES" name, DATE("2019-01-03") date_time, "11" order_id, 10 value UNION ALL
SELECT "JONES", DATE("2019-01-05"), "12", 5 UNION ALL
SELECT "JONES", DATE("2019-06-03"), "13", 3 UNION ALL
SELECT "JONES", DATE("2019-07-03"), "14", 20 UNION ALL
SELECT "John", DATE("2019-07-23"), "15", 10
),
aggs as (
select name, min(date_time) as first_order_date, min(order_id) as first_order_id, sum(value) as total
from data
group by 1
)
select
name,
first_order_id as f_id,
first_order_date as f_date,
sum(value) as total_value_day3,
total
from aggs
inner join data using(name)
where date_time < date_add(first_order_date, interval 3 day) -- <= perhaps
group by 1,2,3,5
请注意,这是假设order_id是连续的(aka order_id 11总是出现在order_id 12之前),其方式与日期是连续的。