具有重复标记的XML文件中的Perl正则表达式多行匹配

时间:2011-12-28 23:43:02

标签: xml perl regex-greedy

最终,我正在尝试将XML文件的所有非空元素包装在

'<![CDATA[...]]>'

以下是我正在测试我的代码的示例:

<currentTime4 dsi="user 2009/06/02 10:43">10:36</currentTime4>
<todayDate dsi="user 2009/06/02 10:43">06/02/2009</todayDate>
<todayDate3 dsi="user 2009/06/02 10:43">06/02/2009</todayDate3>
<todayDate4 dsi="user 2009/06/02 10:43">06/02/2009</todayDate4>
<currentTime dsi="user 2009/06/02 10:43">10:36</currentTime>
<Relationship dsi="user 2009/06/02 10:43"></Relationship>
<PatSignatureIII dsi="user 2009/06/02 10:43"></PatSignatureIII>
<PatSignatureIV dsi="user 2009/06/02 10:43"></PatSignatureIV>
<PatSignature dsi="user 2009/06/02 10:43">313031320D0A3</PatSignature>
<Relationship dsi="user 2009/06/02 10:43">Mother</Relationship>
<currentTime3 dsi="user 2009/06/02 10:43">10:36</currentTime3>
</consent_to_treat>

它模仿我必须处理的XML,但实际上,一些元素包含多行文本,这使得这次冒险更有趣......

我构建了一个正则表达式,只要没有重复项就可以工作:

$text =~ s/(<(\w+) +[" \w\/\-=:]+?>)(?!\n)(.+?)(?<!\n)(<\/\2>)/$1<!\[CDATA\[$3\]\]>$4/gs;

但在此示例中失败,如下所示:

<consent_to_treat dsi="user 2009/06/02 10:43" version="">
<currentTime4 dsi="user 2009/06/02 10:43"><![CDATA[10:36]]></currentTime4>
<todayDate dsi="user 2009/06/02 10:43"><![CDATA[06/02/2009]]></todayDate>
<todayDate3 dsi="user 2009/06/02 10:43"><![CDATA[06/02/2009]]></todayDate3>
<todayDate4 dsi="user 2009/06/02 10:43"><![CDATA[06/02/2009]]></todayDate4>
<currentTime dsi="user 2009/06/02 10:43"><![CDATA[10:36]]></currentTime>
<Relationship dsi="user 2009/06/02 10:43"><![CDATA[</Relationship>
<PatSignatureIII dsi="user 2009/06/02 10:43"></PatSignatureIII>
<PatSignatureIV dsi="user 2009/06/02 10:43"></PatSignatureIV>
<PatSignature dsi="user 2009/06/02 10:43">313031320D0A3</PatSignature>
<Relationship dsi="user 2009/06/02 10:43">Mother]]></Relationship>
<currentTime3 dsi="user 2009/06/02 10:43"><![CDATA[10:36]]></currentTime3>
</consent_to_treat>

让它变得非贪婪的最佳方法是什么,或者,或许是一种与我不同的更好的解决方案?

提前谢谢。

P.S。我相信我最终弄明白了。以下代码似乎可以解决问题:

$text =~ s/(<(\w+) +[" \w\/\-=:]+?>)(?!(\n|\s*<\/\2>))(.+?)(?<!\n)(<\/\2>)/$1<!\[CDATA\[$4\]\]>$5/gs;

再次感谢所有回答我问题的人,我仍然愿意接受更好的解决方案......

1 个答案:

答案 0 :(得分:0)

这个正则表达式将满足您的需求:

s/(<[^>]+>)(.*?)(<\/[^>]+>)/$1<![CDATA[$2]]>$3/gi

代码:

#!/usr/bin/perl

my $xml = <<'END_XML';
<currentTime4 dsi="user 2009/06/02 10:43">10:36</currentTime4>
<todayDate dsi="user 2009/06/02 10:43">06/02/2009</todayDate>
<todayDate3 dsi="user 2009/06/02 10:43">06/02/2009</todayDate3>
<todayDate4 dsi="user 2009/06/02 10:43">06/02/2009</todayDate4>
<currentTime dsi="user 2009/06/02 10:43">10:36</currentTime>
<Relationship dsi="user 2009/06/02 10:43"></Relationship>
<PatSignatureIII dsi="user 2009/06/02 10:43"></PatSignatureIII>
<PatSignatureIV dsi="user 2009/06/02 10:43"></PatSignatureIV>
<PatSignature dsi="user 2009/06/02 10:43">313031320D0A3</PatSignature>
<Relationship dsi="user 2009/06/02 10:43">Mother</Relationship>
<currentTime3 dsi="user 2009/06/02 10:43">10:36</currentTime3>
</consent_to_treat>
END_XML

$xml =~ s/(<[^>]+>)(.*?)(<\/[^>]+>)/$1<![CDATA[$2]]>$3/gi;

print $xml;

输出:

<currentTime4 dsi="user 2009/06/02 10:43"><![CDATA[10:36]]></currentTime4>
<todayDate dsi="user 2009/06/02 10:43"><![CDATA[06/02/2009]]></todayDate>
<todayDate3 dsi="user 2009/06/02 10:43"><![CDATA[06/02/2009]]></todayDate3>
<todayDate4 dsi="user 2009/06/02 10:43"><![CDATA[06/02/2009]]></todayDate4>
<currentTime dsi="user 2009/06/02 10:43"><![CDATA[10:36]]></currentTime>
<Relationship dsi="user 2009/06/02 10:43"><![CDATA[]]></Relationship>
<PatSignatureIII dsi="user 2009/06/02 10:43"><![CDATA[]]></PatSignatureIII>
<PatSignatureIV dsi="user 2009/06/02 10:43"><![CDATA[]]></PatSignatureIV>
<PatSignature dsi="user 2009/06/02 10:43"><![CDATA[313031320D0A3]]></PatSignature>
<Relationship dsi="user 2009/06/02 10:43"><![CDATA[Mother]]></Relationship>
<currentTime3 dsi="user 2009/06/02 10:43"><![CDATA[10:36]]></currentTime3>
</consent_to_treat>