展平类似于数组的BigQuery字符串

时间:2019-05-15 22:54:17

标签: json google-bigquery

我正在使用BigQuery标准SQL方言。

我有一列我知道是JSON字典的数组。

数组长度在行与行之间是可变的。

我想对此进行展平,以便可以访问数组中每个字典的JSON元素。

例如,假设我有两条记录。第一个的id为1,在JSON列中为

[
    {"key1":"val1a", "key2": "val1b"},
    {"key1":"val1c", "key2": "val1d"}
]

第二个元素的id为2,并且在JSON列中的

[{"key1":"val2a", "key2":"val2b"}]

我的目标是

id | key1  | key2  | offset
---------------------------
1  | val1a | val1b |   1
1  | val1c | val1d |   2
2  | val2a | val2b |   1

(尽管我可以没有offset列)

看起来像这样的东西可以工作...

WITH table AS (
SELECT 1 as id,['{"key1":"val1a", "key2": "val1b"}','{"key1":"val1c", "key2": "val1d"}'] as array_column
UNION ALL
SELECT 2 as id,['{"key1":"val2a", "key2":"val2b"}'] as array_column)

SELECT id,
    json_extract_scalar(flattened_array, '$.key1') as key1,
    json_extract_scalar(flattened_array, '$.key2') as key2
FROM table t 
CROSS JOIN UNNEST(t.array_column) AS flattened_array

实际上,该查询返回了我期望的表(减去offset列,这很容易添加)

问题在于BigQuery无法理解这是一个类似于JSON的字符串数组。它认为整个事情是一个很大的字符串,否则我不知道如何说服它。编辑我的示例以模拟这种类型的混淆会演示该问题:

WITH table AS (
SELECT 1 as id,'[{"key1":"val1a", "key2": "val1b"},{"key1":"val1c", "key2": "val1d"}]' as array_column
UNION ALL
SELECT 2 as id,'[{"key1":"val2a", "key2":"val2b"}]' as array_column)

SELECT id,
    json_extract_scalar(flattened_array, '$.key1') as key1,
    json_extract_scalar(flattened_array, '$.key2') as key2
FROM table t 
CROSS JOIN UNNEST(t.array_column) AS flattened_array

在这里,验证者抱怨,因为 UNNEST中引用的值必须是数组。 UNNEST在[29:23]包含STRING类型的表达式。

现在,我们处于问题的核心。有什么明显的方法可以使BigQuery理解此字符串是JSON字典的有效数组吗?也许某些我忽略的JSON_*函数会使数组变平吗?还是通过某种方法CAST将此东西存储到数组中?

2 个答案:

答案 0 :(得分:1)

您可以使用BigQuery JavaScript UDF以任意方式解析JSON:

CREATE TEMP FUNCTION flatten_array(array_column STRING)
RETURNS ARRAY<STRUCT<key1 STRING, key2 STRING>>
LANGUAGE js
AS """
  return JSON.parse(array_column)
""";

WITH table AS (
SELECT 1 as id,'[{"key1":"val1a", "key2": "val1b"},{"key1":"val1c", "key2": "val1d"}]' as array_column
UNION ALL
SELECT 2 as id,'[{"key1":"val2a", "key2":"val2b"}]' as array_column)

SELECT id,
    key1,
    key2
FROM table t 
CROSS JOIN UNNEST(flatten_array(array_column)) AS flattened_array

enter image description here

要获得更好的本机BQ JSON数组支持,vote issue 63716683 up,并订阅更新。

答案 1 :(得分:1)

以下内容适用于BigQuery Standard SQL,如果您的json与示例中的内容一样简单,我建议使用它

'w'

您可以使用问题中的示例数据来测试,玩转上面的示例

#standardSQL
SELECT id, key1, key2
FROM table,
UNNEST(REGEXP_EXTRACT_ALL(array_column, r'"key1"\s*:\s*"(.*?)"')) key1 WITH OFFSET
JOIN UNNEST(REGEXP_EXTRACT_ALL(array_column, r'"key2"\s*:\s*"(.*?)"')) key2 WITH OFFSET
USING (OFFSET)

有结果

#standardSQL
WITH table AS (
  SELECT 1 AS id,'[{"key1":"val1a", "key2": "val1b"},{"key1":"val1c", "key2": "val1d"}]' AS array_column UNION ALL
  SELECT 2 AS id,'[{"key1":"val2a", "key2":"val2b"}]' AS array_column
)
SELECT id, key1, key2
FROM table,
UNNEST(REGEXP_EXTRACT_ALL(array_column, r'"key1"\s*:\s*"(.*?)"')) key1 WITH OFFSET
JOIN UNNEST(REGEXP_EXTRACT_ALL(array_column, r'"key2"\s*:\s*"(.*?)"')) key2 WITH OFFSET
USING (OFFSET)

不确定100%,但是我觉得上面的代码比UDF便宜-这仍然是一个不错的选择:o),尤其是对于更复杂的json的通用情况