是否有其他解决方案通过不带“ with”子句的过滤日期来计算数据?

时间:2019-09-27 12:49:03

标签: time google-bigquery

在BigQuery中,当date

+------+------------+------------------+
| Name |    date    | order_id | value |
+------+------------+----------+-------+
| JONES| 2019-01-03 | 11       |    10 |
| JONES| 2019-01-05 | 12       |    5  |
| JONES| 2019-06-03 | 13       |    3  |
| JONES| 2019-07-03 | 14       |    20 |
| John | 2019-07-23 | 15       |    10 |
+------+------------+----------+-------+

我的解决方法是:

 WITH data AS (
      SELECT "JONES" name, DATE("2019-01-03") date_time, 11 order_id, 10 value
      UNION ALL
      SELECT "JONES", DATE("2019-01-05"), 12, 5
      UNION ALL
      SELECT "JONES", DATE("2019-06-03"), 13, 3
      UNION ALL
      SELECT "JONES", DATE("2019-07-03"), 14, 20
      UNION ALL
      SELECT "John", DATE("2019-07-23"), 15, 10
    ),
data2 AS (
    SELECT *, MIN(date_time) OVER (PARTITION BY name) min_date
    FROM data
)    
    SELECT name,
    ARRAY_AGG(STRUCT(order_id as f_id, date_time as f_date) ORDER BY order_id LIMIT 1)[OFFSET(0)].*,
    sum(case when date_time< date_add(min_date,interval 3 day) then value  end)  as total_value_day3,
    SUM(value) AS total
    FROM data2
    GROUP BY name

输出:

+------+------+------------+----------------+------+
| name | f_id | f_date     |total_value_day3| total|
+------+------+------------+----------------+------+
| JONES| 11   | 2019-01-03 | 15             | 38   | 
| John | 15   | 2019-07-23 | 10             | 10   | 
+------+------+------------+----------------+------+

所以我的问题是,可以用更有效的方法来计算吗? 还是对于大型数据集可以采用这种解决方案?

1 个答案:

答案 0 :(得分:0)

在不使用窗口函数或数组聚合的情况下,以下内容将获得相同的结果,因此BQ只需执行较少的排序/分区。对于这个小例子,我的查询需要更长的时间才能运行,但是字节改组更少。如果针对更大的数据集运行此方法,我认为我的效率会更高。

 WITH data AS (
      SELECT "JONES" name, DATE("2019-01-03") date_time, "11" order_id, 10 value       UNION ALL
      SELECT "JONES", DATE("2019-01-05"), "12", 5      UNION ALL
      SELECT "JONES", DATE("2019-06-03"), "13", 3      UNION ALL
      SELECT "JONES", DATE("2019-07-03"), "14", 20     UNION ALL
      SELECT "John", DATE("2019-07-23"), "15", 10    
),
aggs as (
    select name, min(date_time) as first_order_date, min(order_id) as first_order_id, sum(value) as total
    from data
    group by 1
)    
select 
  name,
  first_order_id as f_id,
  first_order_date as f_date, 
  sum(value) as total_value_day3,
  total
from aggs
inner join data using(name)
where date_time < date_add(first_order_date, interval 3 day) -- <= perhaps
group by 1,2,3,5

请注意,这是假设order_id是连续的(aka order_id 11总是出现在order_id 12之前),其方式与日期是连续的。