是否可以计算每个密钥在JSON列中出现的次数?

时间:2016-10-12 00:31:43

标签: google-bigquery

我有一个BigQuery表,其中包含一个包含JSON的列。

我想输出每个键出现在列中的次数,然后按降序排序。与所有键关联的值为1

每个对象有一个已知/有限数量的键,但是如果看到的最大对象发生变化,我宁愿不依赖它。

整体上有一个已知/有限数量的密钥,但我不想依赖于枚举/更新列表。

e.g。输入:三行,一列名为“json”

[
  {"json": "{'A': 1}"},
  {"json": "{'B': 1}"},
  {"json": "{'B': 1, 'C': 1}"}
]

e.g。输出:三行,两列名为“key”和“count”

[
  {"key": "B", "count": 2},
  {"key": "A", "count": 1},
  {"key": "C", "count": 1}
]

这是最简单的方法,因为我不想依赖每个对象和整体的有限数量的键?

2 个答案:

答案 0 :(得分:3)

  

下面是BigQuery Standard SQL

请参阅Enabling Standard SQLUser-Defined Functions

CREATE TEMPORARY FUNCTION parseJson(y STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
  var z = new Array();
  processKey(JSON.parse(y), '');
  function processKey(node, parent) {
    Object.keys(node).map(function(key) {
      value = node[key].toString();
      if (value !== '[object Object]') {
        z.push(key)
      } else {
        if (parent !== '' && parent.substr(parent.length-1) !== '.') {parent += '.'};
        processKey(node[key], parent + key);
      };
    });         
  };
  return z
""";

WITH theTable AS (
  SELECT '{"json":{"A":"1"}}' AS json UNION ALL 
  SELECT '{"json":{"B":"1"}}' AS json UNION ALL
  SELECT '{"json":{"B":"1","C":"1"}}' AS json
)
SELECT key, COUNT(1) AS `count`
FROM theTable, UNNEST(parseJson(json)) AS key
GROUP BY key
ORDER BY 2 DESC

输出:

key count    
B       2    
A       1    
C       1    

注意:parseJson UDF足够通用,可以处理任何json,所以你可以尝试上面的代码使用下面的输入,它仍然可以工作:

WITH theTable AS (
  SELECT '{"json":{"A":"1"}}' AS json UNION ALL 
  SELECT '{"json":{"B":"1"}}' AS json UNION ALL
  SELECT '{"json":{"B":"1","C":"1"}}' AS json UNION ALL
  SELECT '{"A":"1"}' AS json UNION ALL 
  SELECT '{"B":"1"}' AS json UNION ALL
  SELECT '{"B":"1","C":"1"}' AS json

输出:

key count    
B       4    
A       2    
C       2    
  

为BigQuery Legacy SQL添加了版本

为了简化本文的介绍和进一步测试 - 我在这里使用Legacy SQL UDF的inline version。 Legacy SQL中的Inline version不受官方支持 - 因此如果它适用于您 - 您需要对其进行轻微转换 - 有关BigQuery Legacy SQL中UDF的详细信息,请参阅BigQuery User-Defined Functions

SELECT key, COUNT(1) as cnt
FROM JS((
  SELECT json FROM  
    (SELECT '{"json":{"A":"1"}}' AS json),
    (SELECT '{"json":{"B":"1"}}' AS json),
    (SELECT '{"json":{"B":"1","C":"1"}}' AS json),
    (SELECT '{"A":"1"}' AS json),
    (SELECT '{"B":"1"}' AS json),
    (SELECT '{"B":"1","C":"1"}' AS json)
  ),
  json,                                    // Input columns
  "[{name: 'parent', type:'string'},       // Output schema
   {name: 'key', type:'string'},
   {name: 'value', type:'string'}]",
   "function(r, emit) {                    // The function
      processKey(JSON.parse(r.json), '');
      function processKey(node, parent) {
        Object.keys(node).map(function(key) {
          value = node[key].toString();
          if (value !== '[object Object]') {
            emit({parent:parent, key:key, value:value});
          } else {
            if (parent !== '' && parent.substr(parent.length-1) !== '.') {parent += '.'};
            processKey(node[key], parent + key);
          };
        });         
      };
    }"
  )
GROUP BY key
ORDER BY cnt DESC  

答案 1 :(得分:1)

如果禁用旧版SQL,则可以使用新的bigquery REGEX_EXTRACT_ALL函数,该函数看起来正是您正在寻找的内容:https://cloud.google.com/bigquery/sql-reference/functions-and-operators#regexp_extract_all