我正在使用BigQuery标准SQL方言。
我有一列我知道是JSON字典的数组。
数组长度在行与行之间是可变的。
我想对此进行展平,以便可以访问数组中每个字典的JSON元素。
例如,假设我有两条记录。第一个的id
为1,在JSON列中为
[
{"key1":"val1a", "key2": "val1b"},
{"key1":"val1c", "key2": "val1d"}
]
第二个元素的id
为2,并且在JSON列中的
[{"key1":"val2a", "key2":"val2b"}]
我的目标是
id | key1 | key2 | offset
---------------------------
1 | val1a | val1b | 1
1 | val1c | val1d | 2
2 | val2a | val2b | 1
(尽管我可以没有offset列)
看起来像这样的东西可以工作...
WITH table AS (
SELECT 1 as id,['{"key1":"val1a", "key2": "val1b"}','{"key1":"val1c", "key2": "val1d"}'] as array_column
UNION ALL
SELECT 2 as id,['{"key1":"val2a", "key2":"val2b"}'] as array_column)
SELECT id,
json_extract_scalar(flattened_array, '$.key1') as key1,
json_extract_scalar(flattened_array, '$.key2') as key2
FROM table t
CROSS JOIN UNNEST(t.array_column) AS flattened_array
实际上,该查询返回了我期望的表(减去offset列,这很容易添加)
问题在于BigQuery无法理解这是一个类似于JSON的字符串数组。它认为整个事情是一个很大的字符串,否则我不知道如何说服它。编辑我的示例以模拟这种类型的混淆会演示该问题:
WITH table AS (
SELECT 1 as id,'[{"key1":"val1a", "key2": "val1b"},{"key1":"val1c", "key2": "val1d"}]' as array_column
UNION ALL
SELECT 2 as id,'[{"key1":"val2a", "key2":"val2b"}]' as array_column)
SELECT id,
json_extract_scalar(flattened_array, '$.key1') as key1,
json_extract_scalar(flattened_array, '$.key2') as key2
FROM table t
CROSS JOIN UNNEST(t.array_column) AS flattened_array
在这里,验证者抱怨,因为 UNNEST中引用的值必须是数组。 UNNEST在[29:23]包含STRING类型的表达式。
现在,我们处于问题的核心。有什么明显的方法可以使BigQuery理解此字符串是JSON字典的有效数组吗?也许某些我忽略的JSON_*
函数会使数组变平吗?还是通过某种方法CAST
将此东西存储到数组中?
答案 0 :(得分:1)
您可以使用BigQuery JavaScript UDF以任意方式解析JSON:
CREATE TEMP FUNCTION flatten_array(array_column STRING)
RETURNS ARRAY<STRUCT<key1 STRING, key2 STRING>>
LANGUAGE js
AS """
return JSON.parse(array_column)
""";
WITH table AS (
SELECT 1 as id,'[{"key1":"val1a", "key2": "val1b"},{"key1":"val1c", "key2": "val1d"}]' as array_column
UNION ALL
SELECT 2 as id,'[{"key1":"val2a", "key2":"val2b"}]' as array_column)
SELECT id,
key1,
key2
FROM table t
CROSS JOIN UNNEST(flatten_array(array_column)) AS flattened_array
要获得更好的本机BQ JSON数组支持,vote issue 63716683 up,并订阅更新。
答案 1 :(得分:1)
以下内容适用于BigQuery Standard SQL,如果您的json与示例中的内容一样简单,我建议使用它
'w'
您可以使用问题中的示例数据来测试,玩转上面的示例
#standardSQL
SELECT id, key1, key2
FROM table,
UNNEST(REGEXP_EXTRACT_ALL(array_column, r'"key1"\s*:\s*"(.*?)"')) key1 WITH OFFSET
JOIN UNNEST(REGEXP_EXTRACT_ALL(array_column, r'"key2"\s*:\s*"(.*?)"')) key2 WITH OFFSET
USING (OFFSET)
有结果
#standardSQL
WITH table AS (
SELECT 1 AS id,'[{"key1":"val1a", "key2": "val1b"},{"key1":"val1c", "key2": "val1d"}]' AS array_column UNION ALL
SELECT 2 AS id,'[{"key1":"val2a", "key2":"val2b"}]' AS array_column
)
SELECT id, key1, key2
FROM table,
UNNEST(REGEXP_EXTRACT_ALL(array_column, r'"key1"\s*:\s*"(.*?)"')) key1 WITH OFFSET
JOIN UNNEST(REGEXP_EXTRACT_ALL(array_column, r'"key2"\s*:\s*"(.*?)"')) key2 WITH OFFSET
USING (OFFSET)
不确定100%,但是我觉得上面的代码比UDF便宜-这仍然是一个不错的选择:o),尤其是对于更复杂的json的通用情况