在bigquery中的单个查询中获取最重要的信息

时间:2019-02-27 02:22:55

标签: sql google-bigquery

假设我有以下两个字段:

`name`     `age`
"tom"      20
"tom"      20
"brad"     10
"steve"    14
"alex"     13
"alex"     11

我想在我的页面上填充一个过滤器面板,该面板提供每个字段的最高计数。例如,它看起来像:

name (top 2)
----------------
Alex (2)
Tom (2)

age (top 2)
----------------
20 (2)
10 (1)

通常我会用两个查询来做到这一点:

SELECT name, count(*) FROM mytable GROUP BY name ORDER BY count(*) DESC LIMIT 2;
SELECT age, count(*) FROM mytable GROUP BY age ORDER BY count(*) DESC LIMIT 2

但是,实际上可能有数百个列,所以我不想只加载“过滤器”面板就进行100个查询。有没有办法在单个查询中完成上述操作?它必须是精确的结果,因此不能在其上使用APPROX_TOP_COUNT之类的东西(除非您可以指定100%的精度)。

我将如何构建以上查询?

也许以下查询可以工作,但是如何确保结果和计数正确呢?

select APPROX_TOP_COUNT(name, 2), APPROX_TOP_COUNT(age, 2) from `mytable`

我需要确切的原因是因为这里可能有财务数据,例如,我需要在侧面板中给出确切的“已售单位”或类似数字。

1 个答案:

答案 0 :(得分:1)

以下是用于BigQuery标准SQL

#standardSQL
SELECT
  ARRAY(SELECT REGEXP_REPLACE(name, r'\(0*', '(') FROM t.names name ORDER BY name DESC) names,
  ARRAY(SELECT REGEXP_REPLACE(age, r'\(0*', '(') FROM t.ages age ORDER BY age DESC) ages
FROM (
  SELECT 
    ARRAY_AGG(DISTINCT name ORDER BY name DESC LIMIT 2) names,
    ARRAY_AGG(DISTINCT age ORDER BY age DESC LIMIT 2) ages
  FROM (
    SELECT 
      CONCAT('(', SUBSTR(CONCAT('00000', CAST(COUNT(1) OVER(PARTITION BY name) AS STRING)), -5), ') ', name) name,
      CONCAT('(', SUBSTR(CONCAT('00000', CAST(COUNT(1) OVER(PARTITION BY age) AS STRING)), -5), ') ', CAST(age AS STRING)) age
    FROM `project.dataset.table`
  )
) t

您可以使用问题中的示例数据来进行测试,如上示例所示

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'tom' name, 20 age UNION ALL
  SELECT 'tom', 20 UNION ALL
  SELECT 'brad', 10 UNION ALL
  SELECT 'steve', 14 UNION ALL
  SELECT 'alex', 13 UNION ALL
  SELECT 'alex', 11 
)
SELECT
  ARRAY(SELECT REGEXP_REPLACE(name, r'\(0*', '(') FROM t.names name ORDER BY name DESC) names,
  ARRAY(SELECT REGEXP_REPLACE(age, r'\(0*', '(') FROM t.ages age ORDER BY age DESC) ages
FROM (
  SELECT 
    ARRAY_AGG(DISTINCT name ORDER BY name DESC LIMIT 2) names,
    ARRAY_AGG(DISTINCT age ORDER BY age DESC LIMIT 2) ages
  FROM (
    SELECT 
      CONCAT('(', SUBSTR(CONCAT('00000', CAST(COUNT(1) OVER(PARTITION BY name) AS STRING)), -5), ') ', name) name,
      CONCAT('(', SUBSTR(CONCAT('00000', CAST(COUNT(1) OVER(PARTITION BY age) AS STRING)), -5), ') ', CAST(age AS STRING)) age
    FROM `project.dataset.table`
  )
) t

有结果

Row     names       ages     
1       (2) tom     (2) 20   
        (2) alex    (1) 14   
  

更新I'd like to have it as an array (exactly as it would be in select APPROX_TOP_COUNT(name, 2), APPROX_TOP_COUNT(age, 2) from mytable)

请参见下文-仅更改了外部SELECT中的两行

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'tom' name, 20 age UNION ALL
  SELECT 'tom', 20 UNION ALL
  SELECT 'brad', 10 UNION ALL
  SELECT 'steve', 14 UNION ALL
  SELECT 'alex', 13 UNION ALL
  SELECT 'alex', 11 
)
SELECT
  ARRAY(SELECT STRUCT(REGEXP_EXTRACT(name, r'\(\d*\) (.*)') AS value, CAST(REGEXP_EXTRACT(name, r'\((\d*)\)') AS INT64) AS `count`) FROM t.names name ORDER BY name DESC) names,
  ARRAY(SELECT STRUCT(REGEXP_EXTRACT(age, r'\(\d*\) (.*)') AS value, CAST(REGEXP_EXTRACT(age, r'\((\d*)\)') AS INT64) AS `count`) FROM t.ages age ORDER BY age DESC) ages
FROM (
  SELECT 
    ARRAY_AGG(DISTINCT name ORDER BY name DESC LIMIT 2) names,
    ARRAY_AGG(DISTINCT age ORDER BY age DESC LIMIT 2) ages
  FROM (
    SELECT 
      CONCAT('(', SUBSTR(CONCAT('00000', CAST(COUNT(1) OVER(PARTITION BY name) AS STRING)), -5), ') ', name) name,
      CONCAT('(', SUBSTR(CONCAT('00000', CAST(COUNT(1) OVER(PARTITION BY age) AS STRING)), -5), ') ', CAST(age AS STRING)) age
    FROM `project.dataset.table`
  )
) t

有结果

Row names.value names.count ages.value  ages.count   
1   tom         2           20          2    
    alex        2           14          1