BigQuery计算两个日期范围重叠

时间:2020-04-08 05:23:14

标签: arrays date google-bigquery intervals

我最终通过下载所有数据并在Python中进行迭代来解决了这个问题,但是我不知道在BigQuery中是否有办法做到这一点。

我们有一个包含开始日期和结束日期的表:

begin_date, end_date
'2016-02-19', '2016-02-19'
'2016-02-20', '2016-02-25'
'2016-02-21', '2016-02-25'
'2016-02-22', NULL

我们想要每个日期的行数,其中begin_date <= date <= end_date。对于任何特定值,选择计数很容易:

SELECT COUNT(*) FROM `table` WHERE begin_date <= '2016-12-19' AND (end_date >= '2016-12-19' OR end_date IS NULL)

因此,如果我针对感兴趣的每个值手动执行此操作,则所需的输出可能如下所示:

begin_date, count
2016-02-19, 1
2016-02-20, 1
2016-02-21, 2
2016-02-22, 3
2016-02-23, 3
2016-02-24, 3
2016-02-25, 3
2016-02-26, 1
etc.

创建要迭代的日期列表很容易:

WITH dates AS (SELECT * FROM UNNEST(GENERATE_DATE_ARRAY('2018-10-01', '2020-09-30', INTERVAL 1 DAY)) AS example)

现在,我正在努力在所有这些日期中应用上述WHERE子句。我看到在与单个列(like here)匹配时具有范围的分区是如何工作的,但是我需要同时匹配begin_date和end_date。

我以为我可以做到这一点:

SELECT
  status_begin_date,
  (SELECT COUNT(1) FROM UNNEST(ends) AS e WHERE (e >= status_begin_date OR e IS NULL)) AS cnt
FROM (
  SELECT
    status_begin_date,
    ARRAY_AGG(status_end_date) OVER(ORDER BY status_begin_date) AS ends
  FROM `table`
)
ORDER BY status_begin_date

取自here。这适用于StackOverflow答案中给出的一个小示例,但是在具有几亿行的表上使用它时出现资源错误: enter image description here BigQuery中有可扩展的解决方案吗?

2 个答案:

答案 0 :(得分:2)

以下内容适用于BigQuery Standard SQL,并且不使用低效的游标方法,而是使用基于经典SQL的游标集

#standardSQL
WITH dates AS (
  SELECT day 
  FROM (SELECT MIN(begin_date) min_date, MAX(end_date) max_date FROM `table`), 
  UNNEST(GENERATE_DATE_ARRAY(min_date, CURRENT_DATE(), INTERVAL 1 DAY)) AS day
)
SELECT day, COUNT(*) 
FROM dates 
JOIN `table` 
ON begin_date <= day AND (end_date >= day OR end_date IS NULL)
GROUP BY day

您可以使用问题中的示例数据来测试,玩转上面的示例

#standardSQL
WITH `table` AS (
  SELECT DATE '2016-02-19' begin_date, DATE '2016-02-19' end_date UNION ALL
  SELECT '2016-02-20', '2016-02-25' UNION ALL
  SELECT '2016-02-21', '2016-02-25' UNION ALL
  SELECT '2016-02-22', NULL
), dates AS (
  SELECT day 
  FROM (SELECT MIN(begin_date) min_date, MAX(end_date) max_date FROM `table`), 
  UNNEST(GENERATE_DATE_ARRAY(min_date, max_date, INTERVAL 1 DAY)) AS day
)
SELECT day, COUNT(*) 
FROM dates 
JOIN `table` 
ON begin_date <= day AND (end_date >= day OR end_date IS NULL)
GROUP BY day
-- ORDER BY day  

有结果

Row day         f0_  
1   2016-02-19  1    
2   2016-02-20  1    
3   2016-02-21  2    
4   2016-02-22  3    
5   2016-02-23  3    
6   2016-02-24  3    
7   2016-02-25  3    

答案 1 :(得分:0)

此讨厌的代码有效:

DECLARE dates ARRAY <DATE>;
DECLARE x INT64 DEFAULT 0;
DECLARE results ARRAY <INT64>;
DECLARE results_dates ARRAY <DATE>;
DECLARE result INT64;
DECLARE date DATE;
SET dates = GENERATE_DATE_ARRAY('2016-02-17', '2019-05-13', INTERVAL 1 DAY);
LOOP
  SET date = dates[OFFSET(x)];
  SET result = (SELECT COUNT(*) FROM `table` WHERE begin_date <= date AND (end_date >= date OR end_date IS NULL));
  SET results = ARRAY_CONCAT(results, [result]);
  SET results_dates = ARRAY_CONCAT(results_dates, [date]);
  SET x = x + 1;
  IF x >= ARRAY_LENGTH(dates) THEN
    LEAVE;
  END IF;
END LOOP;
SELECT date, count_subscribers
FROM UNNEST(results_dates) AS date WITH OFFSET 
JOIN UNNEST(results) AS count_subscribers WITH OFFSET
USING(OFFSET)

1.5小时的运行时间,比我的Python代码(7小时)要好,但是BigQuery代码不可并行化,而Python代码则可并行化。