json列上的聚合

时间:2015-12-18 13:17:03

标签: json google-bigquery

我的表格中包含一个字符串列,其中包含JSON的对象集合。假设对象是单词。

我想聚合选择最流行的单词(比如map-reduce示例)。数据不在Bigquery的嵌套记录中。我知道我需要使用JSON_EXTRACT。

例如: 用户词

123“{”totalItems“:2,”items“:[{”word“:”drink“},{”word“:”food“}]}”, 456“{”totalItems“:3,”items“:[{”word“:”food“},{”word“:”dog“},”word“:”drink“}]}”, 123“{”totalItems“:1,”items“:[{”word“:”drink“}]}”

结果应该是: 3喝 2食物 1只狗

如果我按用户分组,那将是: 用户ID计数字 123 2喝, 123 1食物, 456 1食物......等等......

提前致谢

2 个答案:

答案 0 :(得分:2)

按Word

SELECT id, word, COUNT(1) AS cnt FROM (
  SELECT id, REGEXP_EXTRACT(item, r':"(\w+)"') AS word,
  FROM (
    SELECT id, SPLIT(JSON_EXTRACT(items, '$.items')) AS item
    FROM 
    (SELECT 123 AS id, '{"totalItems":2,"items":[{"word":"drink"},{"word":"food"}]}' AS items), 
    (SELECT 456 AS id, '{"totalItems":3,"items":[{"word":"food"},{"word":"dog"},{"word":"drink"}]}' AS items), 
    (SELECT 123 AS id, '{"totalItems":1,"items":[{"word":"drink"}]}' AS items) 
  )
)
GROUP BY id, word

按用户,Word

SELECT word, COUNT(1) AS cnt FROM (
  SELECT REGEXP_EXTRACT(item, r':"(\w+)"') AS word,
  FROM (
    SELECT SPLIT(JSON_EXTRACT(items, '$.items')) AS item
    FROM 
    (SELECT 123 AS id, '{"totalItems":2,"items":[{"word":"drink"},{"word":"food"}]}' AS items), 
    (SELECT 456 AS id, '{"totalItems":3,"items":[{"word":"food"},{"word":"dog"},{"word":"drink"}]}' AS items), 
    (SELECT 123 AS id, '{"totalItems":1,"items":[{"word":"drink"}]}' AS items) 
  )
)
GROUP BY word

答案 1 :(得分:1)

米哈伊尔的回答很好!请注意,需要进行一些调整,使用SPLIT和REGEXP_EXTRACT执行,因为JSON_EXTRACT函数不能很好地处理数组。

另一种方法,如果您想使用BigQuery JavaScript UDF:

SELECT userid, word, COUNT(*) c
FROM (
  SELECT * FROM
  js(
    // I wish you had given me a sample table instead when asking the question
    (SELECT * FROM 
      (SELECT 123 AS id, '{"totalItems":2,"items":[{"word":"drink"},{"word":"food"}]}' AS items), 
      (SELECT 456 AS id, '{"totalItems":3,"items":[{"word":"food"},{"word":"dog"},{"word":"drink"}]}' AS items), 
      (SELECT 123 AS id, '{"totalItems":1,"items":[{"word":"drink"}]}' AS items) 
    ),
    // Input columns.
    id, items,
    // Output schema.
    "[{name: 'word', type:'string'},
     {name: 'userid', type:'integer'}]",
     // The function.
     "function(r, emit) { 
      x=JSON.parse(r.items)
      x.items.forEach(function(entry) {
        emit({word:entry.word, userid:r.id});
      });     
    }"
  )
)
GROUP BY 1,2