这是git以JSON格式登录的输出示例。
问题是,body
密钥有时会出现断行,这使得无法解析此JSON文件,除非它得到纠正。
# start of cross-section
[{
"commit-hash": "11d07df4ce627d98bd30eb1e37c27ac9515c75ff",
"abbreviated-commit-hash": "11d07df",
"author-name": "Robert Lucian CHIRIAC",
"author-email": "robert.lucian.chiriac@gmail.com",
"author-date": "Sat, 27 Jan 2018 22:33:37 +0200",
"subject": "@fix(automation): patch versions aren't released",
"sanitized-subject-line": "fix-automation-patch-versions-aren-t-released",
"body": "Nothing else to add.
Fixes #24.",
"commit-notes": ""
},
# end of cross-section
我已经浏览了sed
的手册页,但解释很难被消化。有没有人对如何将body
的值放入一行并因此摆脱所有这些断行有一些建议?我们的想法是使文件有效,以便能够解析它。
最后,它应该是这样的:
...
"body": "Nothing else to add. Fixes #24."
...
答案 0 :(得分:2)
你可以试试这个但是字符串值中的双引号可能会破坏它:
awk -F'"' '
NF == 4 {in_string = 1}
in_string && NF == 2 {in_string = 0}
{printf "%s%s", $0, in_string ? " " : ORS}
' file.json
要处理内部引号问题,让我们尝试用其他文本替换所有转义引号,处理换行符,然后恢复转义的引号:
awk -F'"' -v escaped_quote_marker='!@_Q_@!' '
{gsub(/\\\"/, escaped_quote_marker)}
NF == 4 {in_string = 1}
in_string && NF == 2 {in_string = 0}
{
gsub(escaped_quote_marker, "\\\"")
printf "%s%s", $0, in_string ? " " : ORS
}
' <<END
[{
"foo":"bar",
"baz":"a string with \"escaped
quotes\" and \"newlines\"
."
}]
END
[{
"foo":"bar",
"baz":"a string with \"escaped quotes\" and \"newlines\" ."
}]
我认为git log至少足以让你逃脱报价。
答案 1 :(得分:2)
使用GNU awk进行多字符RS和patsplit(),无论输入中是否存在转义引号,都会起作用:
$ cat tst.awk
BEGIN { RS="^$"; ORS="" }
{
gsub(/@/,"@A")
gsub(/\\"/,"@B")
nf = patsplit($0,flds,/"[^"]*"/,seps)
$0 = ""
for (i=0; i<=nf; i++) {
$0 = $0 gensub(/\s*\n\s*/," ","g",flds[i]) seps[i]
}
gsub(/@B/,"\\\"")
gsub(/@A/,"@")
print
}
$ awk -f tst.awk file
# start of cross-section
[{
"commit-hash": "11d07df4ce627d98bd30eb1e37c27ac9515c75ff",
"abbreviated-commit-hash": "11d07df",
"author-name": "Robert Lucian CHIRIAC",
"author-email": "robert.lucian.chiriac@gmail.com",
"author-date": "Sat, 27 Jan 2018 22:33:37 +0200",
"subject": "@fix(automation): patch versions aren't released",
"sanitized-subject-line": "fix-automation-patch-versions-aren-t-released",
"body": "Nothing else to add. Fixes #24.",
"commit-notes": ""
},
# end of cross-section
它用输入中不存在的字符串(第一个gsub()确保)替换每个转义的引号,然后对“...”字符串进行操作,然后将转义的引号放回去。
答案 2 :(得分:1)
sed
无法轻松处理多行输入 。您可以在perl
模式中使用slurp
:
perl -0777 -pe 's~("body":\h*"|\G(?<!^))([^\n"]*)\n+~$1$2 ~' file
# start of cross-section
[{
"commit-hash": "11d07df4ce627d98bd30eb1e37c27ac9515c75ff",
"abbreviated-commit-hash": "11d07df",
"author-name": "Robert Lucian CHIRIAC",
"author-email": "robert.lucian.chiriac@gmail.com",
"author-date": "Sat, 27 Jan 2018 22:33:37 +0200",
"subject": "@fix(automation): patch versions aren't released",
"sanitized-subject-line": "fix-automation-patch-versions-aren-t-released",
"body": "Nothing else to add. Fixes #24.",
"commit-notes": ""
},
# end of cross-section
\G
在上一场比赛结束时或第一场比赛的字符串开头处断言位置。 (?<!^)
是一个负向前瞻,以确保我们不匹配起始位置。("body":\h*"|\G(?<!^))
表达式匹配"body":
或上一场比赛结束