Bigquery阵列的非重复计数

时间:2018-09-24 19:13:10

标签: arrays google-bigquery

我想在行之间连接数组,然后进行不同的计数。理想情况下,这可以工作:

WITH test AS
(
  SELECT
  DATE('2018-01-01') as date,
  2 as value,
  [1,2,3] as key
  UNION ALL
  SELECT
  DATE('2018-01-02') as date,
  3 as value,
  [1,4,5] as key
)
SELECT
  SUM(value) as total_value,
  ARRAY_LENGTH(ARRAY_CONCAT_AGG(DISTINCT key)) as unique_key_count
FROM test

不幸的是,ARRAY_CONCAT_AGG函数不支持DISTINCT运算符。我可以对数组进行嵌套,但随后出现扇出,并且value列的总和是错误的:

WITH test AS
(
  SELECT
  DATE('2018-01-01') as date,
  2 as value,
  [1,2,3] as key
  UNION ALL
  SELECT
  DATE('2018-01-02') as date,
  3 as value,
  [1,4,5] as key
)

SELECT
  SUM(value) as total_value,
  COUNT(DISTINCT k) as unique_key_count

FROM test
  CROSS JOIN UNNEST(key) k

enter image description here

我缺少什么可以让我避免加入未嵌套的数组吗?

2 个答案:

答案 0 :(得分:3)

这里是替代方法:

CREATE TEMP FUNCTION DistinctCount(arr ANY TYPE) AS (
  (SELECT COUNT(DISTINCT x) FROM UNNEST(arr) AS x)
);

WITH test AS
(
  SELECT
  DATE('2018-01-01') as date,
  2 as value,
  [1,2,3] as key
  UNION ALL
  SELECT
  DATE('2018-01-02') as date,
  3 as value,
  [1,4,5] as key
)

SELECT
  SUM(value) as total_value,
  DistinctCount(ARRAY_CONCAT_AGG(key)) as unique_key_count
FROM test

这避免了子查询或需要将数组与表连接(导致总和重复的值)。

答案 1 :(得分:3)

以下是用于BigQuery标准SQL

#standardSQL
WITH test AS
(
  SELECT DATE('2018-01-01') AS DATE, 2 AS value, [1,2,3] AS key UNION ALL
  SELECT DATE('2018-01-02') AS DATE, 3 AS value, [1,4,5] AS key
)
SELECT 
  total_value,
  COUNT(DISTINCT key) unique_key_count
FROM (
  SELECT
    SUM(value) AS total_value,
    ARRAY_CONCAT_AGG(key) AS all_keys
  FROM test
), UNNEST(all_keys) key
GROUP BY total_value  

结果:

Row total_value unique_key_count     
1   5           5     

如果您的表中有很多行-您可以轻松解决内存/资源问题-在这种情况下,您可以尝试使用HyperLogLog++ Functions进行近似汇总-参见下面的示例

#standardSQL
WITH test AS
(
  SELECT DATE('2018-01-01') AS DATE, 2 AS value, [1,2,3] AS key UNION ALL
  SELECT DATE('2018-01-02') AS DATE, 3 AS value, [1,4,5] AS key
)
SELECT
  SUM(value) total_value,
  HLL_COUNT.MERGE((SELECT HLL_COUNT.INIT(key) FROM UNNEST(key) key)) AS unique_key_count
FROM test

有结果

Row total_value unique_key_count     
1   5           5

注意:这是近似汇总-因此请注意precision函数中的HLL_COUNT.INIT(input [, precision])参数