每月在BigQuery中获取最近的已知记录

时间:2019-11-07 15:20:39

标签: google-bigquery

帐户余额收集,显示特定日期客户的帐户余额:

+---------------+---------+------------+
|  customer_id  |  value  | timestamp  |
+---------------+---------+------------+
| 1             |  -500   | 2019-10-12 |
| 1             |  -300   | 2019-10-11 |
| 1             |  -200   | 2019-10-10 |
| 1             |  0      | 2019-10-09 |
| 2             |  200    | 2019-09-10 |
| 1             |  600    | 2019-09-02 |
+---------------+---------+------------+

请注意,第二名客户10月份的帐户余额没有更新。

我想获得每个客户每月的最后一个帐户余额。如果在给定月份中没有客户的帐户余额更新,则应将最近的已知帐户余额转移到当月。结果应如下所示:

+---------------+---------+------------+
|  customer_id  |  value  | timestamp  |
+---------------+---------+------------+
| 1             |  -500   | 2019-10-12 |
| 2             |  200    | 2019-10-10 |
| 2             |  200    | 2019-09-10 |
| 1             |  600    | 2019-09-02 |
+---------------+---------+------------+

由于2号客户的帐户余额不是在10月更新,而是在9月更新,因此我们创建了9月的行副本,将日期更改为10月。有什么想法可以在BigQuery中实现吗?

2 个答案:

答案 0 :(得分:1)

以下查询应该主要通过为每个客户每月创建一个“月末”记录并获取最新余额来回答您的问题:

with 

-- Generate a set of months
month_begins as (
  select dt from unnest(generate_date_array('2019-01-01','2019-12-01', interval 1 month)) dt
),

-- Get the month ends
month_ends as (
  select date_sub(date_add(dt, interval 1 month), interval 1 day) as month_end_date from month_begins
),

--  Cross Join and group so we get 1 customer record for every month to account for 
--  situations where customer doesn't change balance in a month
user_month_ends as (
  select
    customer_id,
    month_end_date
  from `project.dataset.table`
  cross join month_ends
  group by 1,2
),

--  Fan out so for each month end, you get all balances prior to month end for each customer
values_prior_to_month_end as (
  select
    customer_id,
    value,
    timestamp,
    month_end_date
  from `project.dataset.table`
  inner join user_month_ends using(customer_id)
  where timestamp <= month_end_date
),

-- Order by most recent balance before month end, even if it was more than 1+ months ago
ordered as (
  select
    *,
    row_number() over (partition by customer_id, month_end_date order by timestamp desc) as my_row
  from values_prior_to_month_end
),

-- Finally, select only the most recent record for each customer per month
final as (
  select
    * except(my_row)
  from ordered
  where my_row = 1
)
select * from final
order by customer_id, month_end_date desc

一些警告:

  1. 我没有订购符合您所需结果集的结果,并且我还保留了一个月底日期来说明这一概念。您可以轻松更改顺序并排除不需要的字段。
  2. month_begins CTE中,我设定了未来几个月的范围,因此您的结果集将包含“未来几个月”的最新余额。要使它更漂亮,请考虑将'2019-12-01'更改为'current_date()',您的查询将始终返回到本月末。
  3. 您的timestamp字段看起来是日期,因此我使用了日期逻辑,但是如果您的基础字段是实际时间戳,则应该能够应用相同的原理来使用时间戳逻辑。
  4. 在您的结果集中,我不确定为什么第二行(客户2)的时间戳为'2019-10-10',这似乎是任意的,因为客户2没有第二笔余额记录。
  5. 我有意将逻辑分为几个CTE,因此我可以更轻松地注释每个步骤,您肯定可以在同一代码块中执行几个步骤以进行更简洁的查询。

答案 1 :(得分:1)

以下是用于BigQuery标准SQL

#standardSQL
WITH customers AS (
  SELECT DISTINCT customer_id FROM `project.dataset.table`
), months AS (
  SELECT month FROM (
    SELECT DATE_TRUNC(MIN(timestamp), MONTH) min_month, DATE_TRUNC(MAX(timestamp), MONTH) max_month
    FROM `project.dataset.table`
  ), UNNEST(GENERATE_DATE_ARRAY(min_month, max_month, INTERVAL 1 MONTH)) month
)
SELECT customer_id, 
  IFNULL(value, LEAD(value) OVER(win)) value,  
  IFNULL(timestamp, DATE_ADD(LEAD(timestamp) OVER(win), INTERVAL DATE_DIFF(month, LEAD(month) OVER(win), MONTH) MONTH)) timestamp  
FROM months, customers
LEFT JOIN (
  SELECT DATE_TRUNC(timestamp, MONTH) month, customer_id, 
    ARRAY_AGG(STRUCT(value, timestamp) ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)].* 
  FROM `project.dataset.table` 
  GROUP BY month, customer_id
) USING(month, customer_id)
WINDOW win AS (PARTITION BY customer_id ORDER BY month DESC)

是否适用于您的问题的样本数据-如以下示例所示

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 customer_id, -500 value, DATE '2019-10-12' timestamp UNION ALL
  SELECT 1, -300, '2019-10-11' UNION ALL
  SELECT 1, -200, '2019-10-10' UNION ALL
  SELECT 2, 200, '2019-09-10' UNION ALL
  SELECT 2, 100, '2019-08-11' UNION ALL
  SELECT 2, 50, '2019-07-12' UNION ALL
  SELECT 1, 600, '2019-09-02' 
), customers AS (
  SELECT DISTINCT customer_id FROM `project.dataset.table`
), months AS (
  SELECT month FROM (
    SELECT DATE_TRUNC(MIN(timestamp), MONTH) min_month, DATE_TRUNC(MAX(timestamp), MONTH) max_month
    FROM `project.dataset.table`
  ), UNNEST(GENERATE_DATE_ARRAY(min_month, max_month, INTERVAL 1 MONTH)) month
)
SELECT customer_id, 
  IFNULL(value, LEAD(value) OVER(win)) value,  
  IFNULL(timestamp, DATE_ADD(LEAD(timestamp) OVER(win), INTERVAL DATE_DIFF(month, LEAD(month) OVER(win), MONTH) MONTH)) timestamp  
FROM months, customers
LEFT JOIN (
  SELECT DATE_TRUNC(timestamp, MONTH) month, customer_id, 
    ARRAY_AGG(STRUCT(value, timestamp) ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)].* 
  FROM `project.dataset.table` 
  GROUP BY month, customer_id
) USING(month, customer_id)
WINDOW win AS (PARTITION BY customer_id ORDER BY month DESC)
-- ORDER BY month DESC, customer_id   

结果是

Row customer_id value   timestamp    
1   1           -500    2019-10-12   
2   2           200     2019-10-10   
3   1           600     2019-09-02   
4   2           200     2019-09-10   
5   1           null    null     
6   2           100     2019-08-11   
7   1           null    null     
8   2           50      2019-07-12