我有类似以下的XML文件:
<?xml version="1.0" encoding="UTF-8"?>
<OnlineCommentary>
<doc docid="cnn_210085_comment002" articleURL="http://www.cnn.com/News.asp?NewsID=210085" date="10/07/2010" time="00:21" subtitle="Is Justin Bieber getting special treatment?" author="Zorro75">
<seg id="1"> They are the same thing. Let's shoot them both. </seg>
</doc>
<doc docid="cnn_210092_comment004" articleURL="http://www.cnn.com/News.asp?NewsID=210092" date="06/04/2010" time="17:07" subtitle="Dear Chicago, we love you despite it all" author="MRL1313">
<seg id="1"> We can't wait for you to move back either. </seg>
<seg id="2"> You seem quite uptight. </seg>
<seg id="3"> Does your wife (who is also your sister) not give it up any more? </seg>
</doc>
</OnlineCommentary>
我想对此文件执行命令,仅提取开始标记<seg ...>
和结束标记</seg>
我试过了:
sed -n 's:.*<seg id="1">\(.*\)</seg>.*:\1:p' XML-file.xml > output.txt
我的问题如下:
- 如何打印所有<seg id="*">
?我的命令只打印第一个标记(<seg id="*">
)
- 是否有一种方法可用于使<seg id="1">
,<seg id="2">
,<seg id="3">
打印在同一行,而标记仅包含<seg id="1">
要单独打印?
答案 0 :(得分:1)
使用适当的XML处理工具。例如,在XML::XSH2中:
open file.xml ;
for //doc echo seg/text() ;
答案 1 :(得分:1)
打印所有<seg id=>
(每行一个),包括<seg
sed -n 's:.*\(<seg id="[0-9]\{1,\}">.*</seg>\).*:\1:p' XML-file.xml > output.txt
以分开的,
打印所有1行。使用保持缓冲区而不是打印,最后调用缓冲区,用,
替换新行(并根据Append action删除起始,
),然后打印结果
sed -n '\:.*\(<seg id="[0-9]\{1,\}">.*</seg>\).*: { s//\1/
H
}
$ {g
s/\n/,/g;s/^,//
p
}' XML-file.xml > output.txt
现在,@ Choroba建议使用充分的XML工具非常好,您可以最大限度地降低处理文件中不需要的数据的风险。