我在bash中有一个程序来获取JSONline文件,每行有几百万个这样的对象(请参阅source)
{
"company_number": "09626947",
"data": {
"address": {
"address_line_1": "Troak Close",
"country": "England",
"locality": "Christchurch",
"postal_code": "BH23 3SR",
"premises": "9",
"region": "Dorset"
},
"country_of_residence": "United Kingdom",
"date_of_birth": {
"month": 11,
"year": 1979
},
"etag": "7123fb76e4ad7ee7542da210a368baa4c89d5a06",
"kind": "individual-person-with-significant-control",
"links": {
"self": "/company/09626947/persons-with-significant-control/individual/FFeqke7T3LvGvX6xmuGqi5SJXAk"
},
"name": "Ms Angela Lynette Miller",
"name_elements": {
"forename": "Angela",
"middle_name": "Lynette",
"surname": "Miller",
"title": "Ms"
},
"nationality": "British",
"natures_of_control": [
"significant-influence-or-control"
],
"notified_on": "2016-06-06"
}
}
我的JQ查询如下:
for file in psc_chunk_*; do
jq --slurp --raw-output 'def pad($n): range(0;$n) as $i |
.[$i]; ([.[] | .data.natures_of_control | length] | max) as $mx |
.[] |
select(.data) |
[.company_number, .data.kind, .data.address.address_line_1, .data.address.country, .data.address.locality, .data.address.postal_code, .data.address.premises, .data.identification.country_registered, .data.identification.legal_authority, .data.identification.legal_form, .data.identification.place_registered, .data.identification.registration_number, .data.ceased_on, .data.country_of_residence, "\(.data.date_of_birth.year)-\(.data.date_of_birth.month)", .data.etag, .data.links.self, .data.name, .data.name_elements.title, .data.name_elements.forename, .data.name_elements.middle_name, .data.name_elements.surname, .data.nationality, .data.notified_on, (.data.natures_of_control | pad($mx))] |
@csv' $file > $file.csv;
done
这可能会伤害到许多JQ专业人员的视线-提取key:value对效率不高,如果提供者碰巧更改了密钥名称,我的代码将无法使用。
有没有一种方法可以将所有json扁平化为csv 将键保留为标题 -额外的困难在于,列表natures_of_control
存在变化条目数(为此我使用了pad函数来获取矩形结果)。
答案 0 :(得分:1)
这是一种基于程序确定标头的方法。为了说明这一点,我们将注意力集中在单个对象上。
由于jq的paths
内置函数会忽略指向null的路径,并且由于此处的要求之一就是不能忽略此类路径,因此我们首先定义一些类似于paths/0
和{{1}的过滤器}:
paths/1
接下来,我们定义一个缩写长路径的函数。您可能希望根据自己的需求进行调整。
# Generate a stream of all paths, including paths to null
def allpaths:
def conditional_recurse(f): def r: ., (select(.!=null) | f | r); r;
path(conditional_recurse(.[]?)) | select(length > 0);
def allpaths(filter):
allpaths as $p | getpath($p) as $v | select($v | filter) | $p;
最后,我们通过生成一行标题,然后是一行相应的值,来将单个对象的情况汇总在一起:
# Input: an array denoting a path; output: a string
def abbreviate: if .[-1]|type == "number" then "\(.[-2]):\(.[-1])" else "\(.[-1])" end;
对于问题中的JSON对象,由上面产生的输出(使用-r命令行选项)将为以下CSV:
[allpaths(scalars)] as $p
| ($p | map(abbreviate) | @csv),
([getpath($p[])] | @csv)
答案 1 :(得分:0)
这是一种解决方案,可通过将输入JSON中的数组转换为“冒号分隔的值”来处理它们:
def atos: map(tostring) | join(":");
还使用了与该页面其他地方相同的通用allpaths
过滤器:
# Generate a stream of all paths, including paths to null
def allpaths:
def conditional_recurse(f): def r: ., (select(.!=null) | f | r); r;
path(conditional_recurse(.[]?)) | select(length > 0);
def allpaths(filter):
allpaths as $p | getpath($p) as $v | select($v | filter) | $p;
再次针对单对象情况,可以按以下方式获得解决方案:
walk( if type == "array" then atos else . end )
| [allpaths(scalars)] as $p
| ($p | map(last) | @csv),
([getpath($p[])] | @csv)
对于给定的输入,输出将是:
"company_number","address_line_1","country","locality","postal_code","premises","region","country_of_residence","month","year","etag","kind","self","name","forename","middle_name","surname","title","nationality","natures_of_control","notified_on"
"09626947","Troak Close","England","Christchurch","BH23 3SR","9","Dorset","United Kingdom",11,1979,"7123fb76e4ad7ee7542da210a368baa4c89d5a06","individual-person-with-significant-control","/company/09626947/persons-with-significant-control/individual/FFeqke7T3LvGvX6xmuGqi5SJXAk","Ms Angela Lynette Miller","Angela","Lynette","Miller","Ms","British","significant-influence-or-control","2016-06-06"
此处介绍的解决方案仅适用于输入中的数组均为标量值的情况。
在下文中,从JSON对象内的键顺序无关紧要的意义上,假定对象流是同构的。
allpaths
和atos
的基础架构如上所述,因此在此不再赘述。
# input: an object
def paths:
walk( if type == "array" then atos else . end )
| [allpaths(scalars)] ;
# input: an array of paths
def headers:
map(last) | @csv ;
# input: an object
def row($paths):
walk( if type == "array" then atos else . end )
| [getpath($paths[])]
| @csv ;
以下代码使用input
读取第一个对象,并使用inputs
读取其余对象,因此使用-n命令行选项调用jq是必不可少的:
input as $first
| ($first|paths) as $paths
| ($paths | headers),
($first | row($paths)),
(inputs | row($paths))