如何在BigQuery中跨多个行合并NULL?

时间:2020-01-11 13:53:17

标签: sql google-bigquery coalesce

我有下表:

Date       |event_number| customer_id1 | customer_age | customer_gender
10/01/2020 |     1      |   abc        |  NULL        |  NULL
10/01/2020 |     2      |   abc        |  NULL        |  male
10/01/2020 |     3      |   abc        |  45          |  NULL
10/01/2020 |     1      |   def        |  30          |  NULL 

我想每天运行一次SQL查询,以查找custom_id1,customer_age,customer_gender的新组合。

输出应如下所示:

query_run_time | customer_id1 | customer_age | customer gender
11/01/2020     | abc          | 45           | male
11/01/2020     | def          | 30           | NULL

查询运行时间是查询运行的日期。如果表中已经存在组合(customer_id,custmer_age,customer_gender),则我不想插入该行。

谢谢

3 个答案:

答案 0 :(得分:0)

您可以使用窗口函数为合并多个查询分配内部行号,例如像这样:

SELECT COALESCE(a.customer_id, b.customer_id) as customer_id
     , customer_age
     , customer_gender
  FROM ( 
         SELECT customer_id, customer_age
              , ROW_NUMBER() OVER ( PARTITION BY customer_id ORDER BY customer_age ) AS row_no
           FROM customer_event
          WHERE customer_age IS NOT NULL
       ) a
  FULL JOIN ( 
         SELECT customer_id, customer_gender
              , ROW_NUMBER() OVER ( PARTITION BY customer_id ORDER BY customer_gender ) AS row_no
           FROM customer_event
          WHERE customer_gender IS NOT NULL
       ) b ON b.customer_id = a.customer_id
          AND b.row_no = a.row_no
 ORDER BY COALESCE(a.customer_id, b.customer_id)
        , COALESCE(a.row_no, b.row_no)

架构和测试数据

CREATE TABLE customer_event (
  event_number      INT           NOT NULL,
  customer_id       VARCHAR(10)   NOT NULL,
  customer_age      INT,
  customer_gender   VARCHAR(10)
);
INSERT INTO customer_event VALUES
( 1, 'abc', NULL, NULL     ),
( 2, 'abc', NULL, 'male'   ),
( 3, 'abc', 45  , NULL     ),
( 4, 'abc', 50  , 'female' ),
( 5, 'abc', 27  , NULL     ),
( 1, 'def', 30  , NULL     );

输出

customer_id  customer_age  customer_gender
abc          27            female
abc          45            male
abc          50            (null)
def          30            (null)

以上内容来自在SQL Fiddle上使用 PostgreSQL 9.6 进行的测试。

答案 1 :(得分:0)

使用Window function

SELECT query_run_time, customer_id, MAX(customer_age) customer_age, 
       MAX(customer_gender)customer_gender
FROM tbl
GROUP BY query_run_time, customer_id

FIDDLE DEMO

输出

query_run_time | customer_id1 | customer_age | customer gender
11/01/2010     | abc          | 45           | male
11/01/2020     | def          | 30           | NULL

答案 2 :(得分:0)

我怀疑您真正想要的是每列的最新值。这是一种方法:

select date, customerid1,
       array_agg(customer_age ignore nulls order by event_number desc limit 1)[safe_ordinal(1) as age,
       array_agg(customer_gender ignore nulls order by event_number desc limit 1)[safe_ordinal(1) as gender
from t
group by date, customerid1;