基于模式使用sed合并多行

时间:2018-01-30 21:12:14

标签: json bash git awk sed

这是git以JSON格式登录的输出示例。 问题是,body密钥有时会出现断行,这使得无法解析此JSON文件,除非它得到纠正。

# start of cross-section
[{
  "commit-hash": "11d07df4ce627d98bd30eb1e37c27ac9515c75ff",
  "abbreviated-commit-hash": "11d07df",
  "author-name": "Robert Lucian CHIRIAC",
  "author-email": "robert.lucian.chiriac@gmail.com",
  "author-date": "Sat, 27 Jan 2018 22:33:37 +0200",
  "subject": "@fix(automation): patch versions aren't released",
  "sanitized-subject-line": "fix-automation-patch-versions-aren-t-released",
  "body": "Nothing else to add.

Fixes #24.",
 "commit-notes": ""
},
# end of cross-section

我已经浏览了sed的手册页,但解释很难被消化。有没有人对如何将body的值放入一行并因此摆脱所有这些断行有一些建议?我们的想法是使文件有效,以便能够解析它。

最后,它应该是这样的:

...
"body": "Nothing else to add. Fixes #24."
...

3 个答案:

答案 0 :(得分:2)

你可以试试这个但是字符串值中的双引号可能会破坏它:

  • 使用双引号作为字段分隔符,我们计算每行中的字段数。
  • 我们预计会有5个领域。
  • 如果有4个,那么我们有一个" open"串。
  • 如果我们在一个打开的字符串中,当我们看到2个字段时,该行包含结束双引号
awk -F'"' '
    NF == 4              {in_string = 1} 
    in_string && NF == 2 {in_string = 0} 
    {printf "%s%s", $0, in_string ? " " : ORS}
' file.json

要处理内部引号问题,让我们尝试用其他文本替换所有转义引号,处理换行符,然后恢复转义的引号:

awk -F'"' -v escaped_quote_marker='!@_Q_@!' '
    {gsub(/\\\"/, escaped_quote_marker)}
    NF == 4              {in_string = 1}
    in_string && NF == 2 {in_string = 0}
    {
        gsub(escaped_quote_marker, "\\\"")
        printf "%s%s", $0, in_string ? " " : ORS
    }
' <<END
[{
    "foo":"bar",
    "baz":"a string with \"escaped
quotes\" and \"newlines\"
."
}]
END
[{
    "foo":"bar",
    "baz":"a string with \"escaped quotes\" and \"newlines\" ."
}]

我认为git log至少足以让你逃脱报价。

答案 1 :(得分:2)

使用GNU awk进行多字符RS和patsplit(),无论输入中是否存在转义引号,都会起作用:

$ cat tst.awk
BEGIN { RS="^$"; ORS="" }
{
    gsub(/@/,"@A")
    gsub(/\\"/,"@B")
    nf = patsplit($0,flds,/"[^"]*"/,seps)
    $0 = ""
    for (i=0; i<=nf; i++) {
        $0 = $0 gensub(/\s*\n\s*/," ","g",flds[i]) seps[i]
    }
    gsub(/@B/,"\\\"")
    gsub(/@A/,"@")
    print
}

$ awk -f tst.awk file
# start of cross-section
[{
  "commit-hash": "11d07df4ce627d98bd30eb1e37c27ac9515c75ff",
  "abbreviated-commit-hash": "11d07df",
  "author-name": "Robert Lucian CHIRIAC",
  "author-email": "robert.lucian.chiriac@gmail.com",
  "author-date": "Sat, 27 Jan 2018 22:33:37 +0200",
  "subject": "@fix(automation): patch versions aren't released",
  "sanitized-subject-line": "fix-automation-patch-versions-aren-t-released",
  "body": "Nothing else to add. Fixes #24.",
 "commit-notes": ""
},
# end of cross-section

它用输入中不存在的字符串(第一个gsub()确保)替换每个转义的引号,然后对“...”字符串进行操作,然后将转义的引号放回去。

答案 2 :(得分:1)

sed无法轻松处理多行输入 。您可以在perl模式中使用slurp

perl -0777 -pe 's~("body":\h*"|\G(?<!^))([^\n"]*)\n+~$1$2 ~' file

# start of cross-section
[{
  "commit-hash": "11d07df4ce627d98bd30eb1e37c27ac9515c75ff",
  "abbreviated-commit-hash": "11d07df",
  "author-name": "Robert Lucian CHIRIAC",
  "author-email": "robert.lucian.chiriac@gmail.com",
  "author-date": "Sat, 27 Jan 2018 22:33:37 +0200",
  "subject": "@fix(automation): patch versions aren't released",
  "sanitized-subject-line": "fix-automation-patch-versions-aren-t-released",
  "body": "Nothing else to add. Fixes #24.",
 "commit-notes": ""
},
# end of cross-section
  • \G在上一场比赛结束时或第一场比赛的字符串开头处断言位置。
  • (?<!^)是一个负向前瞻,以确保我们不匹配起始位置。
  • ("body":\h*"|\G(?<!^))表达式匹配"body":或上一场比赛结束

RegEx Demo