Question

我有类似以下的XML文件：

<?xml version="1.0" encoding="UTF-8"?>
<OnlineCommentary>
    <doc docid="cnn_210085_comment002" articleURL="http://www.cnn.com/News.asp?NewsID=210085" date="10/07/2010" time="00:21" subtitle="Is Justin Bieber getting special treatment?" author="Zorro75">
        <seg id="1"> They are the same thing. Let's shoot them both. </seg>
    </doc>
    <doc docid="cnn_210092_comment004" articleURL="http://www.cnn.com/News.asp?NewsID=210092" date="06/04/2010" time="17:07" subtitle="Dear Chicago, we love you despite it all" author="MRL1313">
        <seg id="1"> We can't wait for you to move back either. </seg>
        <seg id="2"> You seem quite uptight. </seg>
        <seg id="3"> Does your wife (who is also your sister) not give it up any more? </seg>
    </doc>
</OnlineCommentary>

我想对此文件执行命令，仅提取开始标记<seg ...>和结束标记</seg>

之间的连接

我试过了：

sed -n 's:.*<seg id="1">\(.*\)</seg>.*:\1:p' XML-file.xml > output.txt

我的问题如下：

- 如何打印所有<seg id="*">？我的命令只打印第一个标记（<seg id="*">）

的内容

- 是否有一种方法可用于使<seg id="1">，<seg id="2">，<seg id="3">打印在同一行，而标记仅包含<seg id="1">要单独打印？

Answer 1

使用适当的XML处理工具。例如，在XML::XSH2中：

open file.xml ;
for //doc echo seg/text() ;

Answer 2

打印所有<seg id=>（每行一个），包括<seg

sed -n 's:.*\(<seg id="[0-9]\{1,\}">.*</seg>\).*:\1:p' XML-file.xml > output.txt

以分开的,打印所有1行。使用保持缓冲区而不是打印，最后调用缓冲区，用,替换新行（并根据Append action删除起始,），然后打印结果

sed -n '\:.*\(<seg id="[0-9]\{1,\}">.*</seg>\).*:  { s//\1/
   H
   }
$ {g
   s/\n/,/g;s/^,//
   p
   }' XML-file.xml > output.txt

现在，@ Choroba建议使用充分的XML工具非常好，您可以最大限度地降低处理文件中不需要的数据的风险。

使用sed在两个XML标记之间提取文本

2 个答案: