在不使用HLL或UDF的情况下跨Bigquery数组进行简单区分计数

时间:2018-12-06 14:23:02

标签: arrays google-bigquery distinct-values

就像这里的示例一样,我想跨BigQuery数组进行计数:Distinct Count across Bigquery arrays

但是,我还有一些其他要求,这些要求使该帖子中提供的解决方案对我而言是可行的:

  • 解决方案必须使用UDF(太慢
  • 解决方案必须使用HLL功能(必须准确)
  • 解决方案必须使用linked解决方案上显示的SELECT模式中的SELECT,因为该解决方案需要汇总在最终选择的一组灵活尺寸上用户使用BI工具

因此,尽管此扩展示例(包含用户作为分组维度)可以使用HLL:

#standardSQL
WITH
  test AS (
  SELECT
    'A' AS User, DATE('2018-01-01') AS ReportDate, 2 AS value, [1,2,3] AS key
  UNION ALL
  SELECT
    'A' AS User, DATE('2018-01-02') AS ReportDate, 3 AS value, [1,4,5] AS key
  UNION ALL
  SELECT
    'B' AS User, DATE('2018-01-02') AS ReportDate, 4 AS value, [4,5,6,7,8] AS key
  UNION ALL
  SELECT
    'B' AS User, DATE('2018-01-02') AS ReportDate, 5 AS value, [3,4,5,6,7] AS key )
SELECT
  User,
  SUM(value) total_value,
  HLL_COUNT.MERGE((
    SELECT
      HLL_COUNT.INIT(key)
    FROM
      UNNEST(key) key)) AS unique_key_count
FROM
  test
GROUP BY
  user

我需要一个能够满足上述要求的不同聚合数组计数版本。

同样,这意味着如果我仅将 ReportDate 分组,将 User / ReportDate 组合在一起,或者将该示例扩展了其他维度的情况下,它也应该可以正常工作

1 个答案:

答案 0 :(得分:1)

#standardSQL
WITH test AS
(
  SELECT 'A' AS User, DATE('2018-01-01') AS ReportDate, 2 AS value, [1,2,3] AS key UNION ALL
  SELECT 'A' AS User, DATE('2018-01-02') AS ReportDate, 3 AS value, [1,4,5] AS key UNION ALL
  SELECT 'B' AS User, DATE('2018-01-02') AS ReportDate, 4 AS value, [4,5,6,7,8] AS key UNION ALL
  SELECT 'B' AS User, DATE('2018-01-02') AS ReportDate, 5 AS value, [3,4,5,6,7] AS key  
)
SELECT 
  User,
  SUM(IF(flag=0, value, 0)) total_value,
  COUNT(DISTINCT key) unique_key_count
FROM test, UNNEST(key) key WITH OFFSET flag
GROUP BY User   

有结果

Row User    total_value unique_key_count     
1   A       5           5    
2   B       9           6