Question

我已经尝试扫描堆栈溢出中的其他帖子，但是无法使我的代码工作，因此我发布了一个新问题。

以下是文件temp的内容。

 <?xml version="1.0" encoding="UTF-8"?>
 <env:Envelope xmlns:env="http://schemas.xmlsoap.org/soap/envelope/<env:Body><dp:response xmlns:dp="http://www.datapower.com/schemas/management"><dp:timestamp>2015-01-
 22T13:38:04Z</dp:timestamp><dp:file name="temporary://test.txt">XJzLXJlc3VsdHMtYWN0aW9uX18i</dp:file><dp:file name="temporary://test1.txt">lc3VsdHMtYWN0aW9uX18i</dp:file></dp:response></env:Body></env:Envelope>

此文件包含两个文件名test.txt和test1.txt的base64编码内容。我想将每个文件的base64编码内容分别提取为单独的文件test.txt和text1.txt。

为实现这一点，我必须删除base64内容周围的xml标记。我正在尝试下面的命令来实现这一目标。但是，它没有按预期工作。

sed -n '/test.txt"\>/,/\<\/dp:file\>/p' temp | perl -p -e 's@<dp:file name="temporary://test.txt">@@g'|perl -p -e 's@</dp:file>@@g' > test.txt

sed -n '/test1.txt"\>/,/\<\/dp:file\>/p' temp | perl -p -e 's@<dp:file name="temporary://test1.txt">@@g'|perl -p -e 's@</dp:file></dp:response></env:Body></env:Envelope>@@g' > test1.txt

命令下方：

sed -n '/test.txt"\>/,/\<\/dp:file\>/p' temp | perl -p -e 's@<dp:file name="temporary://test.txt">@@g'|perl -p -e 's@</dp:file>@@g'

产生输出：

 XJzLXJlc3VsdHMtYWN0aW9uX18i

<dp:file name="temporary://test1.txt">lc3VsdHMtYWN0aW9uX18i</dp:response>   </env:Body></env:Envelope>`

但是，在输出中我只期望第一行XJzLXJlc3VsdHMtYWN0aW9uX18i。我犯错的地方？

当我在命令下运行时，我得到了预期的输出：

sed -n '/test1.txt"\>/,/\<\/dp:file\>/p' temp | perl -p -e 's@<dp:file name="temporary://test1.txt">@@g'|perl -p -e 's@</dp:file></dp:response></env:Body></env:Envelope>@@g'

它产生以下字符串

lc3VsdHMtYWN0aW9uX18i

然后我可以轻松地将其路由到test1.txt文件。

更新

我通过更新源文件内容来编辑问题。源文件不包含任何换行符。当前的解决方案在这种情况下不起作用，我已经尝试过但失败了。 wc -l temp必须输出到1。

OS: solaris 10 Shell: bash

Answer 1

sed -n 's_<dp:file name="\([^"]*\)">\([^<]*\).*_\1 -> \2_p' temp

我添加\1 ->以显示从文件名到内容的链接，但仅限内容，只需删除此部分
posix版本等GNU sed使用--posix
假设base64编码的内容与周围的标记位于同一行（并且没有在几行上传播，在这种情况下需要进行一些修改）

感谢 JID 以获取完整解释

工作原理

sed -n

-n表示没有打印，所以除非明确告知打印，否则sed

将没有输出

's_

这是用_替换以下正则表达式将正则表达式替换为替换。

<dp:file name=

常规文字

"\([^"]*\)"

括号是一个捕获组，除非使用-r选项（-r在posix上不可用），否则必须对其进行转义。括号内的所有内容都被捕获。 [^"]*表示不是引用的任何字符的0或更多次出现。所以这真的只是捕捉两个引号之间的任何内容。

>\([^<]*\)<

这次再次使用捕获组捕获>和<

之间的所有内容

.*

行上的其他所有内容

_\1 -> \2

这是替换，所以在使用第一个捕获组之前替换正则表达式中的所有内容，然后替换->，然后替换第二个捕获组。

_p

表示打印行

资源

http://unixhelp.ed.ac.uk/CGI/man-cgi?sed

http://www.grymoire.com/Unix/Sed.html

Answer 2

/usr/xpg4/bin/sed在这里运作良好。

如果文件只包含1行，则

/usr/bin/sed无法正常工作。

下面的命令适用于只包含单行的文件。

/usr/xpg4/bin/sed -n 's_<env:Envelope\(.*\)<dp:file name="temporary://BackUpDir/backupmanifest.xml">\([^>]*\)</dp:file>\(.*\)_\2_p' securebackup.xml 2>/dev/null

如果没有2>/dev/null，则此sed命令会输出警告sed: Missing newline at end of file。

这是因为以下原因：

Solaris默认sed忽略最后一行不破坏现有脚本，因为原始Unix实现中的新行需要终止一行。

GNU sed有一个更放松的行为，POSIX实现接受这个事实，但输出一个警告。

使用bash或perl在两个不同的字符串之间提取内容

2 个答案: