Question

这是git以JSON格式登录的输出示例。问题是，body密钥有时会出现断行，这使得无法解析此JSON文件，除非它得到纠正。

# start of cross-section
[{
  "commit-hash": "11d07df4ce627d98bd30eb1e37c27ac9515c75ff",
  "abbreviated-commit-hash": "11d07df",
  "author-name": "Robert Lucian CHIRIAC",
  "author-email": "robert.lucian.chiriac@gmail.com",
  "author-date": "Sat, 27 Jan 2018 22:33:37 +0200",
  "subject": "@fix(automation): patch versions aren't released",
  "sanitized-subject-line": "fix-automation-patch-versions-aren-t-released",
  "body": "Nothing else to add.

Fixes #24.",
 "commit-notes": ""
},
# end of cross-section

我已经浏览了sed的手册页，但解释很难被消化。有没有人对如何将body的值放入一行并因此摆脱所有这些断行有一些建议？我们的想法是使文件有效，以便能够解析它。

最后，它应该是这样的：

...
"body": "Nothing else to add. Fixes #24."
...

Answer 1

你可以试试这个但是字符串值中的双引号可能会破坏它：

使用双引号作为字段分隔符，我们计算每行中的字段数。
我们预计会有5个领域。
如果有4个，那么我们有一个＆＃34; open＆＃34;串。
如果我们在一个打开的字符串中，当我们看到2个字段时，该行包含结束双引号

awk -F'"' '
    NF == 4              {in_string = 1} 
    in_string && NF == 2 {in_string = 0} 
    {printf "%s%s", $0, in_string ? " " : ORS}
' file.json

要处理内部引号问题，让我们尝试用其他文本替换所有转义引号，处理换行符，然后恢复转义的引号：

awk -F'"' -v escaped_quote_marker='!@_Q_@!' '
    {gsub(/\\\"/, escaped_quote_marker)}
    NF == 4              {in_string = 1}
    in_string && NF == 2 {in_string = 0}
    {
        gsub(escaped_quote_marker, "\\\"")
        printf "%s%s", $0, in_string ? " " : ORS
    }
' <<END
[{
    "foo":"bar",
    "baz":"a string with \"escaped
quotes\" and \"newlines\"
."
}]
END

[{
    "foo":"bar",
    "baz":"a string with \"escaped quotes\" and \"newlines\" ."
}]

我认为git log至少足以让你逃脱报价。

Answer 2

使用GNU awk进行多字符RS和patsplit（），无论输入中是否存在转义引号，都会起作用：

$ cat tst.awk
BEGIN { RS="^$"; ORS="" }
{
    gsub(/@/,"@A")
    gsub(/\\"/,"@B")
    nf = patsplit($0,flds,/"[^"]*"/,seps)
    $0 = ""
    for (i=0; i<=nf; i++) {
        $0 = $0 gensub(/\s*\n\s*/," ","g",flds[i]) seps[i]
    }
    gsub(/@B/,"\\\"")
    gsub(/@A/,"@")
    print
}

$ awk -f tst.awk file
# start of cross-section
[{
  "commit-hash": "11d07df4ce627d98bd30eb1e37c27ac9515c75ff",
  "abbreviated-commit-hash": "11d07df",
  "author-name": "Robert Lucian CHIRIAC",
  "author-email": "robert.lucian.chiriac@gmail.com",
  "author-date": "Sat, 27 Jan 2018 22:33:37 +0200",
  "subject": "@fix(automation): patch versions aren't released",
  "sanitized-subject-line": "fix-automation-patch-versions-aren-t-released",
  "body": "Nothing else to add. Fixes #24.",
 "commit-notes": ""
},
# end of cross-section

它用输入中不存在的字符串（第一个gsub（）确保）替换每个转义的引号，然后对“...”字符串进行操作，然后将转义的引号放回去。

Answer 3

sed无法轻松处理多行输入。您可以在perl模式中使用slurp：

perl -0777 -pe 's~("body":\h*"|\G(?<!^))([^\n"]*)\n+~$1$2 ~' file

# start of cross-section
[{
  "commit-hash": "11d07df4ce627d98bd30eb1e37c27ac9515c75ff",
  "abbreviated-commit-hash": "11d07df",
  "author-name": "Robert Lucian CHIRIAC",
  "author-email": "robert.lucian.chiriac@gmail.com",
  "author-date": "Sat, 27 Jan 2018 22:33:37 +0200",
  "subject": "@fix(automation): patch versions aren't released",
  "sanitized-subject-line": "fix-automation-patch-versions-aren-t-released",
  "body": "Nothing else to add. Fixes #24.",
 "commit-notes": ""
},
# end of cross-section

\G在上一场比赛结束时或第一场比赛的字符串开头处断言位置。
(?<!^)是一个负向前瞻，以确保我们不匹配起始位置。
("body":\h*"|\G(?<!^))表达式匹配"body":或上一场比赛结束

RegEx Demo

基于模式使用sed合并多行

3 个答案: