汇总连续值BigQuery

时间:2019-01-07 15:18:31

标签: sql google-bigquery

我需要使用BigQuery汇总表中的连续值,如示例所示 细分只能是“ A”或“ B”。 是一个字符串。
基本上,对于每个 id ,我只需要考虑差距就可以考虑 segment ='A'。 它应该是ORDER BY date_column ASC

示例

id, segment, value, date_column
1, A, 3, daytime
1, A, 2, daytime
1, A, x, daytime
1, B, 3, daytime
1, B, 3, daytime
1, B, 3, daytime
1, A, 7, daytime
1, A, 3, daytime
1, B, 3, daytime
1, A, 9, daytime
1, A, 9, daytime
2, A, 3, daytime
2, B, 3, daytime
2, A, 3, daytime
2, A, m, daytime

预期结果

id, agg_values_A_segment
1, ['32x', '73', '99']
2, ['3', '3m']

如何获得此结果? 我正在努力应对细分市场之间的“差距”。

2 个答案:

答案 0 :(得分:2)

BigQuery Standard SQL的以下选项

选项1 -使用窗口分析功能

#standardSQL
SELECT id, ARRAY_AGG(values_in_group ORDER BY grp) agg_values_A_segment
FROM (
  SELECT id, grp, STRING_AGG(value, '' ORDER BY date_column) values_in_group
  FROM (
    SELECT id, segment, value, date_column, flag, 
      COUNTIF(flag) OVER(PARTITION BY id ORDER BY date_column) grp
    FROM (
      SELECT *, IFNULL(LAG(segment) OVER(PARTITION BY id ORDER BY date_column), segment) != segment flag
      FROM `project.dataset.table`
    )
  )
  WHERE segment = 'A'
  GROUP BY id, grp
)
GROUP BY id   

您可以使用问题中的示例数据来测试,玩游戏,如下例所示:

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 id, 'A' segment, '3' value, DATETIME '2019-01-07T18:46:21' date_column UNION ALL
  SELECT 1, 'A', '2', '2019-01-07T18:46:22' UNION ALL
  SELECT 1, 'A', 'x', '2019-01-07T18:46:23' UNION ALL
  SELECT 1, 'B', '3', '2019-01-07T18:46:24' UNION ALL
  SELECT 1, 'B', '3', '2019-01-07T18:46:25' UNION ALL
  SELECT 1, 'B', '3', '2019-01-07T18:46:26' UNION ALL
  SELECT 1, 'A', '7', '2019-01-07T18:46:27' UNION ALL
  SELECT 1, 'A', '3', '2019-01-07T18:46:28' UNION ALL
  SELECT 1, 'B', '3', '2019-01-07T18:46:29' UNION ALL
  SELECT 1, 'A', '9', '2019-01-07T18:46:30' UNION ALL
  SELECT 1, 'A', '9', '2019-01-07T18:46:31' UNION ALL
  SELECT 2, 'A', '3', '2019-01-07T18:46:32' UNION ALL
  SELECT 2, 'B', '3', '2019-01-07T18:46:33' UNION ALL
  SELECT 2, 'A', '3', '2019-01-07T18:46:34' UNION ALL
  SELECT 2, 'A', 'm', '2019-01-07T18:46:35' 
)
SELECT id, ARRAY_AGG(values_in_group ORDER BY grp) agg_values_A_segment
FROM (
  SELECT id, grp, STRING_AGG(value, '' ORDER BY date_column) values_in_group
  FROM (
    SELECT id, segment, value, date_column, flag, 
      COUNTIF(flag) OVER(PARTITION BY id ORDER BY date_column) grp
    FROM (
      SELECT *, IFNULL(LAG(segment) OVER(PARTITION BY id ORDER BY date_column), segment) != segment flag
      FROM `project.dataset.table`
    )
  )
  WHERE segment = 'A'
  GROUP BY id, grp
)
GROUP BY id
-- ORDER BY id   

有结果

Row id  agg_values_A_segment     
1   1   32x  
        73   
        99   
2   2   3    
        3m     

选项2 -上面的选项应该适用于每个id大行的行,但是看起来有点沉重-因此第二个选项更简单,但假定您具有某些字符或字符序列您确定不会合并值,例如管道字符或制表符,或者如以下示例所示,我选择单词“定界符”,前提是它不会由于串联而出现

#standardSQL
SELECT id,
  ARRAY(SELECT part FROM UNNEST(parts) part WHERE part != '') agg_values_A_segment 
FROM (
  SELECT id, 
    SPLIT(STRING_AGG(IF(segment = 'A', value, 'delimiter'), ''), 'delimiter') parts
  FROM `project.dataset.table`
  GROUP BY id
)

您可以使用相同的示例数据来测试以上内容:

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 id, 'A' segment, '3' value, DATETIME '2019-01-07T18:46:21' date_column UNION ALL
  SELECT 1, 'A', '2', '2019-01-07T18:46:22' UNION ALL
  SELECT 1, 'A', 'x', '2019-01-07T18:46:23' UNION ALL
  SELECT 1, 'B', '3', '2019-01-07T18:46:24' UNION ALL
  SELECT 1, 'B', '3', '2019-01-07T18:46:25' UNION ALL
  SELECT 1, 'B', '3', '2019-01-07T18:46:26' UNION ALL
  SELECT 1, 'A', '7', '2019-01-07T18:46:27' UNION ALL
  SELECT 1, 'A', '3', '2019-01-07T18:46:28' UNION ALL
  SELECT 1, 'B', '3', '2019-01-07T18:46:29' UNION ALL
  SELECT 1, 'A', '9', '2019-01-07T18:46:30' UNION ALL
  SELECT 1, 'A', '9', '2019-01-07T18:46:31' UNION ALL
  SELECT 2, 'A', '3', '2019-01-07T18:46:32' UNION ALL
  SELECT 2, 'B', '3', '2019-01-07T18:46:33' UNION ALL
  SELECT 2, 'A', '3', '2019-01-07T18:46:34' UNION ALL
  SELECT 2, 'A', 'm', '2019-01-07T18:46:35' 
)
SELECT id,
  ARRAY(SELECT part FROM UNNEST(parts) part WHERE part != '') agg_values_A_segment 
FROM (
  SELECT id, 
    SPLIT(STRING_AGG(IF(segment = 'A', value, 'delimiter'), ''), 'delimiter') parts
  FROM `project.dataset.table`
  GROUP BY id
)
-- ORDER BY id   

显然具有相同的结果

Row id  agg_values_A_segment     
1   1   32x  
        73   
        99   
2   2   3    
        3m      

注意:如果每个id的行太多,第二种选择可能会导致资源超出限制-您只需要在实际数据上尝试

答案 1 :(得分:1)

SQL表表示无序集。在并行列式数据库(例如BigQuery)中尤其如此。该答案的其余部分假定您有一列指定行的顺序。

这是一个孤岛问题。您可以使用差异row_number()来识别相邻的组。 。 。然后聚合:

select id, array_agg(vals order by min_ordercol)
from (select id, segment, string_agg(value delimiter '' order by date_column) as vals,
             min(<ordercol>) as min_ordercol
      from (select t.*,
                   row_number() over (partition by id order by date_column) as seqnum,
                   row_number() over (partition by id, segment order by date_column) as seqnum_2,
            from t
           ) t
      group by id, segment, (seqnum - seqnum_2)
     ) x
group by id;