用成对分隔符封闭仍然没有封闭的字符串

时间:2014-01-19 22:34:08

标签: regex perl

需要用封闭的分隔符包围仍未包含的字符串。示例文本:

Some text or random characters here. {% Another random string
enclosed in a pair of delimiters as next {% what can be deeply
nested {% as {%here%}%} end of delimited %} text. %}Another
bla-bla

random text outside of the delimiters - called
as "free text".

需要用

附上自由文本的所有内容
%{ORIG .... original free text ... %}

并且不要修改已包含的字符串。 因此,在上面的例子中需要包含两部分自由文本,并且应该得到下一部分:

{%ORIG Some text or random characters here. %}{% Another random string
enclosed in a pair of delimiters as next {% what can be deeply
nested {% as {%here%}%} end of delimited %} text. %}{%ORIG Another
bla-bla

random text outside of the delimiters - called
as "free text".%}

因此,开场定界符为{%,结束时为%}

问题:

  • 是否可以使用“regexes”执行此操作,或者我需要为此编写一些解析器?
  • 我可以使用一些CPAN模块来完成这项任务吗?

2 个答案:

答案 0 :(得分:6)

您可以在recursive subpattern calls like (?R)的帮助下使用正则表达式执行此操作。

例如:

$_ = <<'_STR_';
Some text or random characters here. {% Another random string
enclosed in a pair of delimiters as next {% what can be deeply
nested {% as {%here%}%} end of delimited %} text. %}Another
bla-bla

random text outside of the delimiters - called
as "free text".
_STR_

s/
  ( {% (?R)* %} )            # match balanced {% %} groups
|
  ( (?: (?! {% | %} ) . )+ ) # match everything except {% %}
/
  $1 ? $1 : "{%ORIG $2 %}";  # if {% ... %} matched, leave it as is. else enclose it
/gsex;

print;

输出:

{%ORIG Some text or random characters here.  %}{% Another random string
enclosed in a pair of delimiters as next {% what can be deeply
nested {% as {%here%}%} end of delimited %} text. %}{%ORIG Another
bla-bla

random text outside of the delimiters - called
as "free text".
 %}

答案 1 :(得分:5)

Jonathan Leffler's建议是对的。您可以使用Text::Balanced模块及其extract_tagged函数来解决此问题:

#!/usr/bin/env perl

use warnings;
use strict;
use Text::Balanced qw<extract_tagged>;

my ($open_delim, $close_delim) = qw( {% %} );

my $text = do { local $/ = undef; <> };
chomp $text;

while (1) {
    my @r = extract_tagged($text, $open_delim, $close_delim, '(?s).*?(?={%)', undef);
    if (length $r[2]) {
        printf qq|%sORIG %s%s|, $open_delim, $r[2], $close_delim;
    }   

    if (length $r[0]) {
        printf qq|%s|, $r[0];
    }   
    else {
        if (length $r[1]) {
            printf qq|%sORIG %s%s|, $open_delim, $r[1], $close_delim;
        }
        last;
    }   

    $text = $r[1];
}

该程序执行无限循环,直到文本中没有更多分隔符。在那之前,在每次迭代中,它检查前缀(文本直到开始分隔符$r[2])并用分隔符围绕它,对于已经用它们包围的文本($r[0]),将其打印为是

一开始我会啜饮整个文件的内容,因为此函数仅适用于标量。您应该查看文档以了解函数返回的内容,并且我希望您能够获得有助于解决问题的想法,以防它比此示例复杂得多。

只是为了测试,运行它:

perl script.pl infile

产量:

{%ORIG Some text or random characters here. %}{% Another random string
enclosed in a pair of delimiters as next {% what can be deeply
nested {% as {%here%}%} end of delimited %} text. %}{%ORIG Another
bla-bla

random text outside of the delimiters - called
as "free text".%}