Question

我正在尝试消除＆lt; ..＆gt;标记出这个小脚本（名称为test）：

<chan‌ges><comment>Testing

Comment

Footer
</comment></chan‌ges>

我尝试使用cat test | sed -e "s/<\/comment>//g; s/<comment>/ /g" > test1，

输出正确：

<chan‌ges> Testing

Comment

Footer
</chan‌ges>

但是当我尝试cat test | sed -e "s/<\/changes>//g; s/<changes>/ /g" > test1时，脚本保持不变。

我已经在shell上复制/粘贴了每个命令并在将它放到这里之前对其进行了测试，所以我认为这不是拼写错误的问题。

任何人都知道这是什么样的黑暗魔法？

Answer 1

假设您要转换：

<chan‌ges><comment>Testing

Comment

Footer
</comment></chan‌ges>

要：

<chan‌ges>Testing

Comment

Footer
</chan‌ges>

您可以使用(?:<(comment)>)(.*)(?:<\/\1>)并替换为\2 https://regex101.com/r/rC1rP6/1

编辑：更简单的正则表达式和sed示例：

cat test | sed 's/<\/\?comment>//g

将comment替换为chan‌ges以匹配其他应答。

注意：失败的原因是因为changes是用unicode字符写的：

cat test | xxd显示：

0000000: 3c63 6861 6ee2 808c e280 8b67 6573 3e3c  <chan......ges><

echo '<changes>' | xxd

：

0000000: 3c63 6861 6e67 6573 3e0a                 <changes>.

Answer 2

我认为正确的工作工具是而不是正则表达式。因为正则表达式并不是很好的标签匹配。我建议使用解析器 - 这是一个perl片段，可以满足您的需求：

#!/usr/bin/env perl

use strict;
use warnings;

use XML::Twig;

print XML::Twig -> parse ( \*DATA ) -> get_xpath('//*',0) -> text;

__DATA__
<changes><comment>Testing

Comment

Footer
</comment></changes>

NB - 必须清理你的源数据，当我复制和粘贴时，这些数据有一些奇怪的字符，这实际上可能是你问题的根源。

这可以是一个衬垫：

perl -MXML::Twig -0777 'print XML::Twig->parse(<>)->get_xpath('//*',0)->text;' you_xml_filename

（或者它接受管道上的输入）。

对于更复杂的场景，这不太适合，但适应更常见的标签剥离非常简单：

E.g：

#!/usr/bin/env perl

use strict;
use warnings;

use XML::Twig;

print map { $_ -> text } XML::Twig -> parse ( \*DATA ) -> get_xpath('//#PCDATA');

__DATA__
<changes><comment>Testing

Comment

Footer
</comment>
<anothercomment>fish here
</anothercomment>
<some_other_tag an_attribute="some_attribute">More text here</some_other_tag>
</changes>

（XML::Twig可能需要安装。这应该像cpan XML::Twig或使用您的包管理器一样简单）

用sep修改脚本。行为不一致

2 个答案: