一个正则表达式,用于排除特定字符串'[['via sed

时间:2015-09-28 12:03:55

标签: regex bash sed regex-negation

我需要使用文件中的sed获取'[['和']]'之间的字符串:response.txt

x-content-type-options: nosniff
x-server-response-time: 63
x-dropbox-request-id: 84e52618f83eda15cb6d96eb4f601f45
pragma: no-cache
cache-control: no-cache
x-dropbox-http-protocol: None
x-frame-options: SAMEORIGIN

{"has_more": false, "cursor": "AAEynx2q5KMgkcOwL2dKZ4MCYxNTtsdA950A5kYOdjWFln_RYuAokMnJCOb85B7idOHjycS8LJye3BhWfezTkkoprVxhgMNni_Bg04A-JO9fLmqIGO3CYInBQPmNUXL57S32ECWwA-CYu1CiLi5ujTDz", "entries": [["/test", {"rev": "b1e9026cf6f4", "thumb_exists": false, "path": "/TEST", "is_dir": true, "icon": "folder", "read_only": false, "modifier": null, "bytes": 0, "modified": "Fri, 22 May 2015 05:53:27 +0000", "size": "0 bytes", "root": "dropbox", "revision": 45545}], ["/TEST/test-file-01", {"rev": "b1ed026cf6f4", "thumb_exists": false, "path": "/test/test-file-01", "is_dir": true, "icon": "folder", "read_only": false, "modifier": null, "bytes": 0, "modified": "Fri, 22 May 2015 06:15:33 +0000", "size": "0 bytes", "root": "dropbox", "revision": 45549}]], "reset": true}

并希望使用命令sed来获取字符串,结果如下:

[["/test", {"rev": "b1e9026cf6f4", "thumb_exists": false, "path": "/TEST", "is_dir": true, "icon": "folder", "read_only": false, "modifier": null, "bytes": 0, "modified": "Fri, 22 May 2015 05:53:27 +0000", "size": "0 bytes", "root": "dropbox", "revision": 45545}], ["/TEST/test-file-01", {"rev": "b1ed026cf6f4", "thumb_exists": false, "path": "/test/test-file-01", "is_dir": true, "icon": "folder", "read_only": false, "modifier": null, "bytes": 0, "modified": "Fri, 22 May 2015 06:15:33 +0000", "size": "0 bytes", "root": "dropbox", "revision": 45549}]]

我在终端运行命令:

$ sed -n 's/.*"entries": *\(\[\[.*\]\]\)/\1/p' /tmp/response.txt

得到结果:

[["/test", {"rev": "b1e9026cf6f4", "thumb_exists": false, "path": "/TEST", "is_dir": true, "icon": "folder", "read_only": false, "modifier": null, "bytes": 0, "modified": "Fri, 22 May 2015 05:53:27 +0000", "size": "0 bytes", "root": "dropbox", "revision": 45545}], ["/TEST/test-file-01", {"rev": "b1ed026cf6f4", "thumb_exists": false, "path": "/test/test-file-01", "is_dir": true, "icon": "folder", "read_only": false, "modifier": null, "bytes": 0, "modified": "Fri, 22 May 2015 06:15:33 +0000", "size": "0 bytes", "root": "dropbox", "revision": 45549}]], "reset": true}

然后,我在终端运行命令:

$ sed -n 's/.*"entries": *\(\[\[(?!\]\].)*\]\]\)/\1/p' /tmp/response.txt

什么也不返回。

好像我写了错误的正则表达式?我能怎么做?谢谢!

4 个答案:

答案 0 :(得分:2)

避免使用正则表达式解析JSON。使用适当的解析器。

如果您安装了jq

awk -v RS="" "END {print}" response.txt | jq -c '.["entries"]'
[["/test",{"revision":45545,"root":"dropbox","size":"0 bytes","modified":"Fri, 22 May 2015 05:53:27 +0000","rev":"b1e9026cf6f4","thumb_exists":false,"path":"/TEST","is_dir":true,"icon":"folder","read_only":false,"modifier":null,"bytes":0}],["/TEST/test-file-01",{"revision":45549,"root":"dropbox","size":"0 bytes","modified":"Fri, 22 May 2015 06:15:33 +0000","rev":"b1ed026cf6f4","thumb_exists":false,"path":"/test/test-file-01","is_dir":true,"icon":"folder","read_only":false,"modifier":null,"bytes":0}]]

或红宝石:

ruby -rjson -e '
    data = (File.readlines(ARGV.shift))[-1]
    json = JSON.parse(data)
    puts JSON.generate(json["entries"])
' response.txt
[["/test",{"rev":"b1e9026cf6f4","thumb_exists":false,"path":"/TEST","is_dir":true,"icon":"folder","read_only":false,"modifier":null,"bytes":0,"modified":"Fri, 22 May 2015 05:53:27 +0000","size":"0 bytes","root":"dropbox","revision":45545}],["/TEST/test-file-01",{"rev":"b1ed026cf6f4","thumb_exists":false,"path":"/test/test-file-01","is_dir":true,"icon":"folder","read_only":false,"modifier":null,"bytes":0,"modified":"Fri, 22 May 2015 06:15:33 +0000","size":"0 bytes","root":"dropbox","revision":45549}]]

或您选择的任何实现JSON解析器的语言。

答案 1 :(得分:0)

这可能适合你(GNU sed):

sed '/\n/!{s/\[\[/\n&/g;s/\]\]/&\n/g};/^\[\[/P;D' file

如果模式空间不包含\n,则将\n添加到所有[[字符串,并将\n附加到所有]]字符串。如果模式空间以[[开头,则打印到以下\n(或模式空间的末尾)。删除到下一个\n(或模式空间的末尾)并重复,直到模式空间为空。

N.B。这只会在以所需字符串开头和结尾的换行符之间打印字符串([[]])。

答案 2 :(得分:0)

sed识别Posix正则表达式,它不包括像(?!这样的外观断言。

幸运的是,为这个简单的案例写一个正则表达式很容易(像往常一样,它不太容易阅读):

sed -n 's/.*"entries": *\(\[\[\(]\?[^]]\)*]]\)/\1/p' /tmp/response.txt

然而,它不是贪婪的匹配,导致你的初始尝试的问题。问题是你不能丢弃比赛后的线路内容。你想要的是:

sed -n 's/.*"entries": *\(\[\[\(]\?[^]]\)*]]\).*/\1/p' /tmp/response.txt

sed使用"基本" Posix regexes(BREs)意味着你最终会得到很多反斜杠。我试图删除至少其中一些,使用] 特殊的正则表达式,除非它正在关闭一个字符类。但总的来说,我认为使用grep可以更好地满足您的需求,grep -oE '"entries": \[\[(]?[^]])*]]' /tmp/response.txt | cut -d ' ' -f2- 具有使用Posix标准选项"扩展" (正常)正则表达式(ERE),以及只打印匹配字符串的选项:

cut

(最后"entries":将删除\[\[ match [[ ( ]? possibly a single ] [^]] anything but a ] )* repeated as many times as necessary ]] match ]]

正则表达式的解释

正则表达式(在ERE表格中)包括:

]

重复的小组会匹配],然后是一个],或者匹配]]以外的任何内容。实际上,它(几乎)是对]的否定。

(这并不是否定,因为它在字符串的末尾不会与单个]]匹配,但这并不重要因为我们在这里坚持要求关闭server { listen 80 default_server; listen [::]:80 default_server ipv6only=on; root /usr/share/nginx/html/hd; index index.php index.html index.htm; server_name localhost; location / { try_files $uri $uri/ /index.php$is_args$args; } rewrite ^themes/.*/(layouts|pages|partials)/.*.htm /index.php break; rewrite ^bootstrap/.* /index.php break; rewrite ^config/.* /index.php break; rewrite ^vendor/.* /index.php break; rewrite ^storage/cms/.* /index.php break; rewrite ^storage/logs/.* /index.php break; rewrite ^storage/framework/.* /index.php break; rewrite ^storage/temp/protected/.* /index.php break; rewrite ^storage/app/uploads/protected/.* /index.php break; location ~ \.php$ { try_files $uri =404; fastcgi_pass unix:/var/run/php5-fpm.sock; fastcgi_index index.php; fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name; include fastcgi_params; } } ,所以它到达字符串末尾的情况不会发生。)

答案 3 :(得分:0)

尝试:

sed -n 's/.*"entries": *\(\[\[.*\]\]\).*/\1/p'

(请注意模式末尾的.*)。