以前的问题

Question

我正在尝试从xml文件中删除以下模式：

对于这个purpouse，我使用了来自Remove CDATA tags from XML file的以下sed命令：

sed -e 's/<![CDATA[//g' | sed -e 's/]]>//g' file.xml

问题在于我无法找到这些模式。它正在重新打印整个文本。

<text>
<![CDATA[
ethnic minority communities have been in Belfast since the 1930s.]]>
<\text>

Answer 1

我建议使用多功能的XmlStarlet工具。要删除CDATA部分并仅保留文本内容，请使用this command：

xml fo --omit-decl --nocdata file.xml

输出：

<text>
ethnic minority communities have been in Belfast since the 1930s.
</text>

当删除CDATA部分（它本身是一个转义机制）时，XmlStarlet会自动转义在XML中具有特殊含义的＆符号（&）。像这样的输入文档，

<text>
<![CDATA[
ethnic minorities & communities have been in Belfast since the 1930s.]]>
</text>

将导致此输出：

<text>
ethnic minorities &amp; communities have been in Belfast since the 1930s.
</text>

Answer 2

xml_grep --text_only 'text' intput.xml > output.txt

其中text是xml元素的名称。

Answer 3

试着回答原来的问题，因为我来到这里找不到。

您需要转义表达式中的左方括号，否则您将使用它打开一个字符类。你不需要在关闭CDATA部分的字符中逃避结束的那些（因为在正则表达式中没有打开字符类部分），但是你可以而且应该为了完整性，因为它们在不是时也有不同的含义。逃脱了。

顺便说一句，你可以告诉sed使用多个替换，在表达式中用分号分隔：

sed -e 's/<!\[CDATA\[//g; s/\]\]>//g' file.xml