折叠较大的BigQuery结果

时间:2018-08-19 17:22:19

标签: sql google-bigquery

对于BigQuery查询的结果,我是否有任何简便的方法可以像Ocaml的fold_left那样进行,每次迭代对应于结果中的一行?

哪种产品或方法是最简单的方法?如果满足以下条件,那就太好了:

  • 我要做的就是提供初始状态和“文件夹”功能
  • 最好,我想用一种功能语言编写“文件夹”功能
  • 我不需要安装任何GCP软件包

由于我不知道哪种产品或语言可以工作,所以我不能更具体,但伪代码将是这样的:

let my_init = []
let my_folder = fun state row ->
  // append for now, but it will be complicated. I need to do some set operations here. The point is that I need some way of transferring "state" across rows, when I iterate over rows in a predefined order.
  row.col1 :: state

let query = "SELECT col1, col2, col3 FROM table1 ORDER BY timestamp"
query |> List.fold my_folder my_init

我想从这个简化示例中得到的结果是最终的“状态”。

---更新---

行数没有限制-如果我们收到更多行,则会得到更多行。通常,这个数字超过几百万,但可能会更大。

这是一个简化的示例,显示了我遇到的主要问题。我们有一个带有几列的表:

  • 时间戳
  • user_id:字符串ID
  • operation_json:一个字符串化的JSON对象,它是一个操作列表,每个操作对应于一个:
    • 将user_id添加到集合
    • 从集合中删除user_id

例如,以下是有效行:

----------+---------+----------------------------------------------
timestamp | user_id | operation_json
----------+---------+----------------------------------------------
1         | id1     | [ { "op": "add", "set": "set1" } ]
2         | id2     | [ { "op": "add", "set": "set1" } ]
3         | id1     | [ { "op": "add", "set": "set2" } ]
4         | id3     | [ { "op": "add", "set": "set2" } ]
5         | id1     | [ { "op": "remove", "set": "set1" } ]
----------+---------+----------------------------------------------

因此,我希望获得一些用户;即

set1 |-> { id2 }
set2 |-> { id1, id3 }

我认为类似fold_left的操作会很方便。状态为map>,初始状态为空地图。

2 个答案:

答案 0 :(得分:3)

下面的[快速简单] BigQuery标准SQL示例

#standardSQL
CREATE TEMP FUNCTION fold(arr ARRAY<INT64>, init INT64)
RETURNS FLOAT64
LANGUAGE js AS """
  const reducer = (accumulator, currentValue) => accumulator + parseInt(currentValue);
  return arr.reduce(reducer, 5);
""";
WITH `project.dataset.table` AS (
  SELECT 1 id, [1, 2, 3, 4] arr, 5 initial_state UNION ALL
  SELECT 2, [1, 2, 3, 4, 5, 6, 7], 10 
)
SELECT id, fold(arr, initial_state) result
FROM `project.dataset.table`   

输出为

Row id  result
1   1   15.0     
2   2   33.0      

我认为这是不言而喻的

有关JS UDF的更多信息

  

折叠行列表

请参见上面的扩展名
在这里,您要在应用fold函数之前从结果的行中组装数组(当然,这里要牢记一些limits以便UDF以及行的数组可以走多大,等等。

#standardSQL
CREATE TEMP FUNCTION fold(arr ARRAY<INT64>, init INT64)
RETURNS FLOAT64
LANGUAGE js AS """
  const reducer = (accumulator, currentValue) => accumulator + parseInt(currentValue);
  return arr.reduce(reducer, 5);
""";
WITH `project.dataset.table` AS (
  SELECT 1 id, 1 item UNION ALL
  SELECT 1, 2 UNION ALL 
  SELECT 1, 3 UNION ALL 
  SELECT 1, 4 UNION ALL 
  SELECT 2, 1 UNION ALL 
  SELECT 2, 2 UNION ALL 
  SELECT 2, 3 UNION ALL 
  SELECT 2, 4 UNION ALL 
  SELECT 2, 5 UNION ALL 
  SELECT 2, 6 UNION ALL 
  SELECT 2, 7 
)
SELECT id, fold(ARRAY_AGG(item), 5) result
FROM `project.dataset.table`  
GROUP BY id

注意,如果您需要在每一行中包含多个字段,则可以使用STRUCT的ARRAY,如下例所示

ARRAY_AGG(STRUCT(id , item) ORDER by id)

当然,您需要分别调整折叠UDF的签名

例如:

#standardSQL
CREATE TEMP FUNCTION fold(arr ARRAY<STRUCT<id INT64, item INT64>>, init INT64)
RETURNS FLOAT64
LANGUAGE js AS """
  const reducer = (accumulator, currentValue) => accumulator + parseInt(currentValue.item);
  return arr.reduce(reducer, 5);
""";
WITH `project.dataset.table` AS (
  SELECT 1 id, 1 item UNION ALL
  SELECT 1, 2 UNION ALL 
  SELECT 1, 3 UNION ALL 
  SELECT 1, 4 UNION ALL 
  SELECT 2, 1 UNION ALL 
  SELECT 2, 2 UNION ALL 
  SELECT 2, 3 UNION ALL 
  SELECT 2, 4 UNION ALL 
  SELECT 2, 5 UNION ALL 
  SELECT 2, 6 UNION ALL 
  SELECT 2, 7 
)
SELECT id, fold(ARRAY_AGG(t), 5) result
FROM `project.dataset.table` t 
GROUP BY id

答案 1 :(得分:1)

以下方法本身与folding无关,而是尝试通过为每个挑战标识最新的op动作,将挑战转化为基于集合的挑战(这对于处理sql更自然)每个用户集,如果它是“删除”,则从进一步考虑中删除该用户-如果它是“添加”,则对该用户/集使用最新的“添加”。假设同一用户/集合不能有多个连续的“添加”操作-可以-添加/删除/添加等等。当然,可以根据实际用例进行进一步调整

因此,请牢记以上内容-BigQuery标准SQL的以下示例

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 ts, 'id1' user_id, '[ { "op": "add", "set": "set1" } ]' operation_json UNION ALL
  SELECT 2, 'id2', '[ { "op": "add", "set": "set1" } ]' UNION ALL
  SELECT 3, 'id1', '[ { "op": "add", "set": "set2" } ]' UNION ALL
  SELECT 4, 'id3', '[ { "op": "add", "set": "set2" } ]' UNION ALL
  SELECT 5, 'id1', '[ { "op": "remove", "set": "set1" } ]' 
)
SELECT bin, STRING_AGG(user_id, ',' ORDER BY ts) result
FROM (
  SELECT user_id, bin, ARRAY_AGG(ts ORDER BY ts DESC LIMIT 1)[OFFSET(0)] ts
  FROM (
    SELECT ts, user_id, op, bin, LAST_VALUE(op) OVER(win) fin
    FROM (
      SELECT ts, user_id, 
        JSON_EXTRACT_SCALAR(REGEXP_REPLACE(operation_json, r'^\[|\]$', ''), '$.op') op, 
        JSON_EXTRACT_SCALAR(REGEXP_REPLACE(operation_json, r'^\[|\]$', ''), '$.set') bin
      FROM `project.dataset.table`
    )
    WINDOW win AS (
      PARTITION BY user_id, bin 
      ORDER BY ts 
      ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
  )
  WHERE fin = 'add'
  GROUP BY user_id, bin
)
GROUP BY bin
-- ORDER BY bin  

输出为

Row bin     result   
1   set1    id2  
2   set2    id1,id3    

如果要应用于以下虚拟数据

WITH `project.dataset.table` AS (
  SELECT 1 ts, 'id1' user_id, '[ { "op": "add", "set": "set1" } ]' operation_json UNION ALL
  SELECT 2, 'id2', '[ { "op": "add", "set": "set1" } ]' UNION ALL
  SELECT 3, 'id1', '[ { "op": "add", "set": "set2" } ]' UNION ALL
  SELECT 4, 'id3', '[ { "op": "add", "set": "set2" } ]' UNION ALL
  SELECT 5, 'id1', '[ { "op": "remove", "set": "set1" } ]' UNION ALL
  SELECT 6, 'id1', '[ { "op": "add", "set": "set1" } ]' UNION ALL
  SELECT 7, 'id1', '[ { "op": "remove", "set": "set1" } ]' UNION ALL
  SELECT 8, 'id1', '[ { "op": "add", "set": "set1" } ]' UNION ALL 
  SELECT 9, 'id1', '[ { "op": "remove", "set": "set2" } ]' UNION ALL
  SELECT 10, 'id1', '[ { "op": "add", "set": "set2" } ]'
)

结果将是

Row bin     result   
1   set1    id2,id1  
2   set2    id3,id1