jq中的SQL样式GROUP BY聚合函数(COUNT,SUM等)

时间:2018-01-18 12:21:10

标签: sql json group-by aggregate-functions jq

此前提出的类似问题:

计算单个项目的项目:jq count the number of items in json by a specific key

计算对象值的总和: How do I sum the values in an array of maps in jq?

问题

如何模拟COUNT聚合函数,该函数应该与其SQL原始函数类似?让我们进一步扩展这个问题,以包含其他常规SQL函数:

  • COUNT
  • SUM / MAX / MIN / AVG
  • ARRAY_AGG

最后一个不是标准的SQL函数 - 它来自PostgreSQL但非常有用。

输入时会出现一组有效的JSON对象。为了示范,我们选择一个关于业主及其宠物的简单故事。

模型和数据

基本关系:所有者

id name  age
 1 Adams  25
 2 Baker  55
 3 Clark  40
 4 Davis  31

基本关系:宠物

id name  litter owner_id
10 Bella      4        1
20 Lucy       2        1
30 Daisy      3        2
40 Molly      4        3
50 Lola       2        4
60 Sadie      4        4
70 Luna       3        4

来源

从上面我们得到一个衍生关系 Owner_Pet (上述关系的SQL JOIN的结果)以JSON格式呈现给我们的jq查询(源数据):

{ "owner_id": 1, "owner": "Adams", "age": 25, "pet_id": 10, "pet": "Bella", "litter": 4 }
{ "owner_id": 1, "owner": "Adams", "age": 25, "pet_id": 20, "pet": "Lucy",  "litter": 2 }
{ "owner_id": 2, "owner": "Baker", "age": 55, "pet_id": 30, "pet": "Daisy", "litter": 3 }
{ "owner_id": 3, "owner": "Clark", "age": 40, "pet_id": 40, "pet": "Molly", "litter": 4 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pet_id": 50, "pet": "Lola",  "litter": 2 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pet_id": 60, "pet": "Sadie", "litter": 4 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pet_id": 70, "pet": "Luna",  "litter": 3 }

以下是示例请求及其预期输出:

  • 计算每位业主的宠物数量:
{ "owner_id": 1, "owner": "Adams", "age": 25, "pets_count": 2 }
{ "owner_id": 2, "owner": "Baker", "age": 55, "pets_count": 1 }
{ "owner_id": 3, "owner": "Clark", "age": 40, "pets_count": 1 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pets_count": 3 }
  • 记录每个拥有者的小龙虾数量获取他们的最大值(MIN / AVG):
{ "owner_id": 1, "owner": "Adams", "age": 25, "litter_total": 6, "litter_max": 4 }
{ "owner_id": 2, "owner": "Baker", "age": 55, "litter_total": 3, "litter_max": 3 }
{ "owner_id": 3, "owner": "Clark", "age": 40, "litter_total": 4, "litter_max": 4 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "litter_total": 9, "litter_max": 4 }
  • 每位业主ARRAY_AGG宠物:
{ "owner_id": 1, "owner": "Adams", "age": 25, "pets": [ "Bella", "Lucy" ] }
{ "owner_id": 2, "owner": "Baker", "age": 55, "pets": [ "Daisy" ] }
{ "owner_id": 3, "owner": "Clark", "age": 40, "pets": [ "Molly" ] }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pets": [ "Lola", "Sadie", "Luna" ] }

3 个答案:

答案 0 :(得分:2)

扩展 jq 解决方案:

自定义 count() 功能:

jq -sc 'def count($k): group_by(.[$k])[] | length as $l | .[0] 
                       | .pets_count = $l 
                       | del(.pet_id, .pet, .litter); 
        count("owner_id")' source.data

输出:

{"owner_id":1,"owner":"Adams","age":25,"pets_count":2}
{"owner_id":2,"owner":"Baker","age":55,"pets_count":1}
{"owner_id":3,"owner":"Clark","age":40,"pets_count":1}
{"owner_id":4,"owner":"Davis","age":31,"pets_count":3}

自定义 sum() 功能:

jq -sc 'def sum($k): group_by(.[$k])[] | map(.litter) as $litters | .[0] 
                     | . + {litter_total: $litters | add, litter_max: $litters | max} 
                     | del(.pet_id, .pet, .litter); 
        sum("owner_id")' source.data

输出:

{"owner_id":1,"owner":"Adams","age":25,"litter_total":6,"litter_max":4}
{"owner_id":2,"owner":"Baker","age":55,"litter_total":3,"litter_max":3}
{"owner_id":3,"owner":"Clark","age":40,"litter_total":4,"litter_max":4}
{"owner_id":4,"owner":"Davis","age":31,"litter_total":9,"litter_max":4}

自定义 array_agg() 功能:

jq -sc 'def array_agg($k): group_by(.[$k])[] | map(.pet) as $pets | .[0] 
                           | .pets = $pets | del(.pet_id, .pet, .litter); 
        array_agg("owner_id")' source.data

输出:

{"owner_id":1,"owner":"Adams","age":25,"pets":["Bella","Lucy"]}
{"owner_id":2,"owner":"Baker","age":55,"pets":["Daisy"]}
{"owner_id":3,"owner":"Clark","age":40,"pets":["Molly"]}
{"owner_id":4,"owner":"Davis","age":31,"pets":["Lola","Sadie","Luna"]}

答案 1 :(得分:1)

这是一个很好的练习,但是SO不是编程服务,因此我将重点关注jq中通用解决方案的一些关键概念,即使对于非常大的集合也是如此。

GROUPS_BY

这里提高效率的关键是避免使用内置group_by,因为它需要排序。由于jq基本上是面向流的,因此GROUPS_BY的以下定义同样是面向流的。它利用了基于密钥的查找的效率,同时避免在字符串上调用tojson

# emit a stream of the groups defined by f
def GROUPS_BY(stream; f): 
  reduce stream as $x ({};
     ($x|f) as $s
     | ($s|type) as $t
     | (if $t == "string" then $s else ($s|tojson) end) as $y
     | .[$t][$y] += [$x] )
   | .[][] ;

distinctcount_distinct

# Emit an array of the distinct entities in `stream`, without sorting
def distinct(stream): 
  reduce stream as $x ({};
      ($x|type) as $t
      | (if $t == "string" then $x else ($x|tojson) end) as $y
      | if (.[$t] | has($y)) then . else .[$t][$y] += [$x] end )
   | [.[][]] | add ;


# Emit the number of distinct items in the given stream
def count_distinct(stream):
   def sum(s): reduce s as $x (0;.+$x);
   reduce stream as $x ({};
       ($x|type) as $t
       | (if $t == "string" then $x else ($x|tojson) end) as $y
       | .[$t][$y] = 1 )
   | sum( .[][] ) ;

便利功能

def owner: {owner_id,owner,age};

示例:“计算每个所有者的宠物数量”

GROUPS_BY(inputs; .owner_id)
| (.[0] | owner) + {pets_count: count_distinct(.[]|.pet_id)}

调用:jq -nc -f program1.jq input.json

输出:

{"owner_id":1,"owner":"Adams","age":25,"pets_count":2}
{"owner_id":2,"owner":"Baker","age":55,"pets_count":1}
{"owner_id":3,"owner":"Clark","age":40,"pets_count":1}
{"owner_id":4,"owner":"Davis","age":31,"pets_count":3}

示例:“汇总每个所有者的小轮数并得到他们的MAX”

GROUPS_BY(inputs; .owner_id)
| (.[0] | owner)
  + {litter_total: (map(.litter) | add)}
  + {litter_max:  (map(.litter) | max)}

调用:jq -nc -f program2.jq input.json

输出:给定。

示例:“每个所有者ARRAY_AGG宠物”

GROUPS_BY(inputs; .owner_id)
| (.[0] | owner) + {pets: distinct(.[]|.pet)}

调用:jq -nc -f program3.jq input.json

输出:

{"owner_id":1,"owner":"Adams","age":25,"pets":["Bella","Lucy"]}
{"owner_id":2,"owner":"Baker","age":55,"pets":["Daisy"]}
{"owner_id":3,"owner":"Clark","age":40,"pets":["Molly"]}
{"owner_id":4,"owner":"Davis","age":31,"pets":["Lola","Sadie","Luna"]}

答案 2 :(得分:1)

这是替代方案,不对基本JQ使用任何自定义函数。 (我有幸摆脱了问题中多余的部分)

计数

In> jq -s 'group_by(.owner_id) |  map({ owner_id: .[0].owner_id, count: map(.pet) | length})'
Out>[{"owner_id": "1","pets_count": 2}, ...]

总和

In> jq -s 'group_by(.owner_id) | map({owner_id: .[0].owner_id, sum: map(.litter) | add})'
Out> [{"owner_id": "1","sum": 6}, ...]

最大

In> jq -s 'group_by(.owner_id) | map({owner_id: .[0].owner_id, max: map(.litter) | max})'
Out> [{"owner_id": "1","max": 4}, ...]

汇总

In> jq -s 'group_by(.owner_id) | map({owner_id: .[0].owner_id, agg: map(.pet) })'
Out> [{"owner_id": "1","agg": ["Bella","Lucy"]}, ...]

当然,这些可能不是最有效的实现,但它们很好地展示了如何自行实现自定义功能。不同功能之间的所有更改都位于最后一个map内部和管道|lengthaddmax)之后的功能

第一个映射遍历不同的组,从第一个项目取名称,然后再次使用map遍历相同组的项目。不像SQL一样漂亮,但并不复杂。

我今天学习了JQ,并且已经设法做到这一点,所以这对于任何入门的人来说都应该是鼓舞人心的。 JQ既不像sed也不像SQL,但也不是很难。