BigQuery中的自我加入问题

时间:2019-12-06 16:28:01

标签: google-bigquery

我正在尝试对当前行周围特定日期范围的值求和。由于Bigquery不支持Windows函数中的日期范围,因此我使用自连接,如下所示:

with test_data as (
   select 1 val1, 7 val2, 'ord001' id, timestamp('2019-01-01 04:00:00') dt_order
   union all
   select 2 val1, 14 val2, 'ord002' id, timestamp('2019-01-02 05:00:00') dt_order
   union all
   select 3 val1, 21 val2, 'ord003' id, timestamp('2019-01-03 06:00:00') dt_order
)

,revenue_coeff as (
select 
   td.id,
   td.val1 *
   (select sum(td1.val2) / sum(td1.val1)
    from test_data td1
    where td1.dt_order >= timestamp_sub(td.dt_order, interval 24 hour) and
       td1.dt_order < timestamp_add(td.dt_order, interval 6 minute)
    )
from test_data td
)

select * from revenue_coeff

此玩具查询正常运行。但是,当我尝试使用实际的BigQuery表时,出现“没有条件,即联接两端的字段均等的情况,则无法使用LEFT OUTER JOIN”。 如何在BQ中实现这样的查询?预先感谢!

2 个答案:

答案 0 :(得分:2)

以下是用于BigQuery标准SQL

我将首先在您的帖子结尾回答您的问题-但是比之在您的帖子顶部回答您的声明。所以...

  

我得到一条“ LEFT OUTER JOIN,如果没有这样的条件,即连接两面的字段均等,则无法使用”。如何在BQ中实现这样的查询?

#standardSQL
WITH `project.dataset.test_data` AS (
   SELECT 1 val1,  7 val2, 'ord001' id, TIMESTAMP('2019-01-01 04:00:00') dt_order UNION ALL
   SELECT 1 val1, 14 val2, 'ord002' id, TIMESTAMP('2019-01-02 05:00:00') dt_order UNION ALL
   SELECT 1 val1, 21 val2, 'ord003' id, TIMESTAMP('2019-01-03 06:00:00') dt_order
), revenue_coeff AS (
  SELECT 
    td1.id, 
    td1.val1  * SUM(td2.val2) / SUM(td2.val1)
  FROM `project.dataset.test_data` td1
  CROSS JOIN `project.dataset.test_data` td2
  WHERE td2.dt_order >= TIMESTAMP_SUB(td1.dt_order, INTERVAL 24 HOUR) 
  AND   td2.dt_order <  TIMESTAMP_ADD(td1.dt_order, INTERVAL 6 MINUTE)
  GROUP BY td1.id, td1.val1  
)
SELECT * FROM revenue_coeff   

如您所见-代替LEFT JOIN,您可以将CROSS JOIN与ON子句一起移入WHERE子句

  

由于Bigquery不支持Windows函数中的日期范围...

实际上,它确实支持-参见示例

#standardSQL
WITH `project.dataset.test_data` AS (
   SELECT 1 val1,  7 val2, 'ord001' id, TIMESTAMP('2019-01-01 04:00:00') dt_order UNION ALL
   SELECT 1 val1, 14 val2, 'ord002' id, TIMESTAMP('2019-01-02 05:00:00') dt_order UNION ALL
   SELECT 1 val1, 21 val2, 'ord003' id, TIMESTAMP('2019-01-03 06:00:00') dt_order
), revenue_coeff AS (
  SELECT id, val1  * SUM(val2) OVER(win) / SUM(val1) OVER(win)
  FROM `project.dataset.test_data` td1
  WINDOW win AS (ORDER BY UNIX_SECONDS(dt_order) RANGE BETWEEN 86400 PRECEDING AND 359 FOLLOWING )
)
SELECT * FROM revenue_coeff   

如您所见-诀窍在于使用UNIX_SECONDS函数将时间戳数据类型“转换”为int

很明显-我建议您使用第二个选项

答案 1 :(得分:0)

您也可以执行左外部联接,例如:

select a.val1, a.id, 
   sum(if(b.dt_order >= timestamp_sub(a.dt_order, interval 24 hour) and b.dt_order <= timestamp_add(a.dt_order, interval 6 minute), b.val2, 0.0))
   /
   sum(if(b.dt_order >= timestamp_sub(a.dt_order, interval 24 hour) and b.dt_order <= timestamp_add(a.dt_order, interval 6 minute), b.val2, 0.0)) 
from test_data a
left join test_data b on 1=1
group by 1,2

但是,您必须在上游或通过在其中添加case语句来管理零除错误。