我有一个BigQuery表,其中包含一个包含JSON的列。
我想输出每个键出现在列中的次数,然后按降序排序。与所有键关联的值为1
。
每个对象有一个已知/有限数量的键,但是如果看到的最大对象发生变化,我宁愿不依赖它。
整体上有一个已知/有限数量的密钥,但我不想依赖于枚举/更新列表。
e.g。输入:三行,一列名为“json”
[
{"json": "{'A': 1}"},
{"json": "{'B': 1}"},
{"json": "{'B': 1, 'C': 1}"}
]
e.g。输出:三行,两列名为“key”和“count”
[
{"key": "B", "count": 2},
{"key": "A", "count": 1},
{"key": "C", "count": 1}
]
这是最简单的方法,因为我不想依赖每个对象和整体的有限数量的键?
答案 0 :(得分:3)
下面是BigQuery Standard SQL
请参阅Enabling Standard SQL和User-Defined Functions
CREATE TEMPORARY FUNCTION parseJson(y STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
var z = new Array();
processKey(JSON.parse(y), '');
function processKey(node, parent) {
Object.keys(node).map(function(key) {
value = node[key].toString();
if (value !== '[object Object]') {
z.push(key)
} else {
if (parent !== '' && parent.substr(parent.length-1) !== '.') {parent += '.'};
processKey(node[key], parent + key);
};
});
};
return z
""";
WITH theTable AS (
SELECT '{"json":{"A":"1"}}' AS json UNION ALL
SELECT '{"json":{"B":"1"}}' AS json UNION ALL
SELECT '{"json":{"B":"1","C":"1"}}' AS json
)
SELECT key, COUNT(1) AS `count`
FROM theTable, UNNEST(parseJson(json)) AS key
GROUP BY key
ORDER BY 2 DESC
输出:
key count
B 2
A 1
C 1
注意:parseJson UDF足够通用,可以处理任何json,所以你可以尝试上面的代码使用下面的输入,它仍然可以工作:
WITH theTable AS (
SELECT '{"json":{"A":"1"}}' AS json UNION ALL
SELECT '{"json":{"B":"1"}}' AS json UNION ALL
SELECT '{"json":{"B":"1","C":"1"}}' AS json UNION ALL
SELECT '{"A":"1"}' AS json UNION ALL
SELECT '{"B":"1"}' AS json UNION ALL
SELECT '{"B":"1","C":"1"}' AS json
)
输出:
key count
B 4
A 2
C 2
为BigQuery Legacy SQL添加了版本
为了简化本文的介绍和进一步测试 - 我在这里使用Legacy SQL UDF的inline version
。 Legacy SQL中的Inline version
不受官方支持 - 因此如果它适用于您 - 您需要对其进行轻微转换 - 有关BigQuery Legacy SQL中UDF的详细信息,请参阅BigQuery User-Defined Functions
SELECT key, COUNT(1) as cnt
FROM JS((
SELECT json FROM
(SELECT '{"json":{"A":"1"}}' AS json),
(SELECT '{"json":{"B":"1"}}' AS json),
(SELECT '{"json":{"B":"1","C":"1"}}' AS json),
(SELECT '{"A":"1"}' AS json),
(SELECT '{"B":"1"}' AS json),
(SELECT '{"B":"1","C":"1"}' AS json)
),
json, // Input columns
"[{name: 'parent', type:'string'}, // Output schema
{name: 'key', type:'string'},
{name: 'value', type:'string'}]",
"function(r, emit) { // The function
processKey(JSON.parse(r.json), '');
function processKey(node, parent) {
Object.keys(node).map(function(key) {
value = node[key].toString();
if (value !== '[object Object]') {
emit({parent:parent, key:key, value:value});
} else {
if (parent !== '' && parent.substr(parent.length-1) !== '.') {parent += '.'};
processKey(node[key], parent + key);
};
});
};
}"
)
GROUP BY key
ORDER BY cnt DESC
答案 1 :(得分:1)
如果禁用旧版SQL,则可以使用新的bigquery REGEX_EXTRACT_ALL函数,该函数看起来正是您正在寻找的内容:https://cloud.google.com/bigquery/sql-reference/functions-and-operators#regexp_extract_all