此前提出的类似问题:
计算单个项目的项目:jq count the number of items in json by a specific key
计算对象值的总和: How do I sum the values in an array of maps in jq?
如何模拟COUNT聚合函数,该函数应该与其SQL原始函数类似?让我们进一步扩展这个问题,以包含其他常规SQL函数:
最后一个不是标准的SQL函数 - 它来自PostgreSQL但非常有用。
输入时会出现一组有效的JSON对象。为了示范,我们选择一个关于业主及其宠物的简单故事。
基本关系:所有者
id name age
1 Adams 25
2 Baker 55
3 Clark 40
4 Davis 31
基本关系:宠物
id name litter owner_id
10 Bella 4 1
20 Lucy 2 1
30 Daisy 3 2
40 Molly 4 3
50 Lola 2 4
60 Sadie 4 4
70 Luna 3 4
从上面我们得到一个衍生关系 Owner_Pet (上述关系的SQL JOIN的结果)以JSON格式呈现给我们的jq查询(源数据):
{ "owner_id": 1, "owner": "Adams", "age": 25, "pet_id": 10, "pet": "Bella", "litter": 4 }
{ "owner_id": 1, "owner": "Adams", "age": 25, "pet_id": 20, "pet": "Lucy", "litter": 2 }
{ "owner_id": 2, "owner": "Baker", "age": 55, "pet_id": 30, "pet": "Daisy", "litter": 3 }
{ "owner_id": 3, "owner": "Clark", "age": 40, "pet_id": 40, "pet": "Molly", "litter": 4 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pet_id": 50, "pet": "Lola", "litter": 2 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pet_id": 60, "pet": "Sadie", "litter": 4 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pet_id": 70, "pet": "Luna", "litter": 3 }
以下是示例请求及其预期输出:
{ "owner_id": 1, "owner": "Adams", "age": 25, "pets_count": 2 }
{ "owner_id": 2, "owner": "Baker", "age": 55, "pets_count": 1 }
{ "owner_id": 3, "owner": "Clark", "age": 40, "pets_count": 1 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pets_count": 3 }
{ "owner_id": 1, "owner": "Adams", "age": 25, "litter_total": 6, "litter_max": 4 }
{ "owner_id": 2, "owner": "Baker", "age": 55, "litter_total": 3, "litter_max": 3 }
{ "owner_id": 3, "owner": "Clark", "age": 40, "litter_total": 4, "litter_max": 4 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "litter_total": 9, "litter_max": 4 }
{ "owner_id": 1, "owner": "Adams", "age": 25, "pets": [ "Bella", "Lucy" ] }
{ "owner_id": 2, "owner": "Baker", "age": 55, "pets": [ "Daisy" ] }
{ "owner_id": 3, "owner": "Clark", "age": 40, "pets": [ "Molly" ] }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pets": [ "Lola", "Sadie", "Luna" ] }
答案 0 :(得分:2)
扩展 jq
解决方案:
自定义 count()
功能:
jq -sc 'def count($k): group_by(.[$k])[] | length as $l | .[0]
| .pets_count = $l
| del(.pet_id, .pet, .litter);
count("owner_id")' source.data
输出:
{"owner_id":1,"owner":"Adams","age":25,"pets_count":2}
{"owner_id":2,"owner":"Baker","age":55,"pets_count":1}
{"owner_id":3,"owner":"Clark","age":40,"pets_count":1}
{"owner_id":4,"owner":"Davis","age":31,"pets_count":3}
自定义 sum()
功能:
jq -sc 'def sum($k): group_by(.[$k])[] | map(.litter) as $litters | .[0]
| . + {litter_total: $litters | add, litter_max: $litters | max}
| del(.pet_id, .pet, .litter);
sum("owner_id")' source.data
输出:
{"owner_id":1,"owner":"Adams","age":25,"litter_total":6,"litter_max":4}
{"owner_id":2,"owner":"Baker","age":55,"litter_total":3,"litter_max":3}
{"owner_id":3,"owner":"Clark","age":40,"litter_total":4,"litter_max":4}
{"owner_id":4,"owner":"Davis","age":31,"litter_total":9,"litter_max":4}
自定义 array_agg()
功能:
jq -sc 'def array_agg($k): group_by(.[$k])[] | map(.pet) as $pets | .[0]
| .pets = $pets | del(.pet_id, .pet, .litter);
array_agg("owner_id")' source.data
输出:
{"owner_id":1,"owner":"Adams","age":25,"pets":["Bella","Lucy"]}
{"owner_id":2,"owner":"Baker","age":55,"pets":["Daisy"]}
{"owner_id":3,"owner":"Clark","age":40,"pets":["Molly"]}
{"owner_id":4,"owner":"Davis","age":31,"pets":["Lola","Sadie","Luna"]}
答案 1 :(得分:1)
这是一个很好的练习,但是SO不是编程服务,因此我将重点关注jq中通用解决方案的一些关键概念,即使对于非常大的集合也是如此。
这里提高效率的关键是避免使用内置group_by
,因为它需要排序。由于jq基本上是面向流的,因此GROUPS_BY
的以下定义同样是面向流的。它利用了基于密钥的查找的效率,同时避免在字符串上调用tojson
:
# emit a stream of the groups defined by f
def GROUPS_BY(stream; f):
reduce stream as $x ({};
($x|f) as $s
| ($s|type) as $t
| (if $t == "string" then $s else ($s|tojson) end) as $y
| .[$t][$y] += [$x] )
| .[][] ;
distinct
和count_distinct
# Emit an array of the distinct entities in `stream`, without sorting
def distinct(stream):
reduce stream as $x ({};
($x|type) as $t
| (if $t == "string" then $x else ($x|tojson) end) as $y
| if (.[$t] | has($y)) then . else .[$t][$y] += [$x] end )
| [.[][]] | add ;
# Emit the number of distinct items in the given stream
def count_distinct(stream):
def sum(s): reduce s as $x (0;.+$x);
reduce stream as $x ({};
($x|type) as $t
| (if $t == "string" then $x else ($x|tojson) end) as $y
| .[$t][$y] = 1 )
| sum( .[][] ) ;
def owner: {owner_id,owner,age};
GROUPS_BY(inputs; .owner_id)
| (.[0] | owner) + {pets_count: count_distinct(.[]|.pet_id)}
调用:jq -nc -f program1.jq input.json
输出:
{"owner_id":1,"owner":"Adams","age":25,"pets_count":2}
{"owner_id":2,"owner":"Baker","age":55,"pets_count":1}
{"owner_id":3,"owner":"Clark","age":40,"pets_count":1}
{"owner_id":4,"owner":"Davis","age":31,"pets_count":3}
GROUPS_BY(inputs; .owner_id)
| (.[0] | owner)
+ {litter_total: (map(.litter) | add)}
+ {litter_max: (map(.litter) | max)}
调用:jq -nc -f program2.jq input.json
输出:给定。
GROUPS_BY(inputs; .owner_id)
| (.[0] | owner) + {pets: distinct(.[]|.pet)}
调用:jq -nc -f program3.jq input.json
输出:
{"owner_id":1,"owner":"Adams","age":25,"pets":["Bella","Lucy"]}
{"owner_id":2,"owner":"Baker","age":55,"pets":["Daisy"]}
{"owner_id":3,"owner":"Clark","age":40,"pets":["Molly"]}
{"owner_id":4,"owner":"Davis","age":31,"pets":["Lola","Sadie","Luna"]}
答案 2 :(得分:1)
这是替代方案,不对基本JQ使用任何自定义函数。 (我有幸摆脱了问题中多余的部分)
计数
In> jq -s 'group_by(.owner_id) | map({ owner_id: .[0].owner_id, count: map(.pet) | length})'
Out>[{"owner_id": "1","pets_count": 2}, ...]
总和
In> jq -s 'group_by(.owner_id) | map({owner_id: .[0].owner_id, sum: map(.litter) | add})'
Out> [{"owner_id": "1","sum": 6}, ...]
最大
In> jq -s 'group_by(.owner_id) | map({owner_id: .[0].owner_id, max: map(.litter) | max})'
Out> [{"owner_id": "1","max": 4}, ...]
汇总
In> jq -s 'group_by(.owner_id) | map({owner_id: .[0].owner_id, agg: map(.pet) })'
Out> [{"owner_id": "1","agg": ["Bella","Lucy"]}, ...]
当然,这些可能不是最有效的实现,但它们很好地展示了如何自行实现自定义功能。不同功能之间的所有更改都位于最后一个map
内部和管道|
(length
,add
,max
)之后的功能
第一个映射遍历不同的组,从第一个项目取名称,然后再次使用map遍历相同组的项目。不像SQL一样漂亮,但并不复杂。
我今天学习了JQ,并且已经设法做到这一点,所以这对于任何入门的人来说都应该是鼓舞人心的。 JQ既不像sed也不像SQL,但也不是很难。