使用sed在两个XML标记之间提取文本

时间:2014-09-19 09:04:32

标签: xml regex linux shell sed

我有类似以下的XML文件:

<?xml version="1.0" encoding="UTF-8"?>
<OnlineCommentary>
    <doc docid="cnn_210085_comment002" articleURL="http://www.cnn.com/News.asp?NewsID=210085" date="10/07/2010" time="00:21" subtitle="Is Justin Bieber getting special treatment?" author="Zorro75">
        <seg id="1"> They are the same thing. Let's shoot them both. </seg>
    </doc>
    <doc docid="cnn_210092_comment004" articleURL="http://www.cnn.com/News.asp?NewsID=210092" date="06/04/2010" time="17:07" subtitle="Dear Chicago, we love you despite it all" author="MRL1313">
        <seg id="1"> We can't wait for you to move back either. </seg>
        <seg id="2"> You seem quite uptight. </seg>
        <seg id="3"> Does your wife (who is also your sister) not give it up any more? </seg>
    </doc>
</OnlineCommentary>

我想对此文件执行命令,仅提取开始标记<seg ...>和结束标记</seg>

之间的连接

我试过了:

sed -n 's:.*<seg id="1">\(.*\)</seg>.*:\1:p' XML-file.xml > output.txt

我的问题如下:

- 如何打印所有<seg id="*">?我的命令只打印第一个标记(<seg id="*">

的内容

- 是否有一种方法可用于使<seg id="1"><seg id="2"><seg id="3">打印在同一行,而标记仅包含<seg id="1">要单独打印?

2 个答案:

答案 0 :(得分:1)

使用适当的XML处理工具。例如,在XML::XSH2中:

open file.xml ;
for //doc echo seg/text() ;

答案 1 :(得分:1)

打印所有<seg id=>(每行一个),包括<seg

sed -n 's:.*\(<seg id="[0-9]\{1,\}">.*</seg>\).*:\1:p' XML-file.xml > output.txt

以分开的,打印所有1行。使用保持缓冲区而不是打印,最后调用缓冲区,用,替换新行(并根据Append action删除起始,),然后打印结果

sed -n '\:.*\(<seg id="[0-9]\{1,\}">.*</seg>\).*:  { s//\1/
   H
   }
$ {g
   s/\n/,/g;s/^,//
   p
   }' XML-file.xml > output.txt

现在,@ Choroba建议使用充分的XML工具非常好,您可以最大限度地降低处理文件中不需要的数据的风​​险。