带有括号的Perl正则表达式问题,其中内容是多行的

时间:2013-12-25 02:10:15

标签: regex perl multiline parentheses

我在文件中有一个字符串,由Perl读取,可以是:

previous content ending with a linebreak
keyword: content
next content

previous content, also ending with a line end
keyword: { content that contains {
nested parenthesis } and may span
multiple lines,c closed by matching parenthesis}
next content

在任何一种情况下,我都成功加载了内容,从上一个内容的开头,到下一个结尾,在一个字符串中,称之为$str

现在,我想在结束之前内容的换行符和下一个内容之前的换行符之间提取内容。

所以我在$str上使用了这样的正则表达式:

if($str =~
        /.*\nkeyword: # keyword: is always constant, immediately after a newline
        (?!\{+)       # NO { follows
        \s+(?!\{+)    # NO { with a heading whitespace
        \s*           # white space between keyword: and content
        (?!\{+)       # no { immediately before content 
                      # question : should the last one be a negative lookbehind AFTER the check for content itself?
        ([^\s]+)      # the content, should be in $1;
        (?!\{+)       # no trailing { immediately after content
        \s+           # delimited by a whitespace, ignore what comes afterwards
        |             # or
        /.*\nkeyword: # keyword: is always constant, immediately after a newline
        (?=\s*{*\s*)*) # any mix of whitespace and {
        (?=\{+)       # at least one {
        (?=\s*{*\s*)*) # again any mix of whitespace and {
        ([^\{\}]+)    # no { or }
        (?=\s*}*\s*)*) # any mix of whitespace and }
        (?=\}+)       # at least one }
        (?=\s*}*\s*)*) # again any mix of whitespace and }
) { #do something with $1}

我意识到这个并不是真正用嵌套括号来处理多行信息;但是,它应该以{{1​​}}

的形式捕获对象

但是,在

的情况下,我能够在keyword: {{ content} }中捕获内容
$1

表格,我无法捕捉

keyword: content 

我终于使用简单的基于计数器的解析器而不是正则表达式来实现它。我很想知道如何在正则表达式中执行此操作,以便捕获第二种形式的对象, 对正则表达式命令的解释。

另外,我的配方出了什么问题,它甚至没有用多个(但匹配的)标题和尾部括号捕获单行内容?

2 个答案:

答案 0 :(得分:1)

您可以使用:

#!/usr/bin/perl
use strict;
use warnings;

my $str = "previous content ending with a linebreak
keyword: content
next content

previous contnet, also ending with a line end
keyword: { content that contains {
nested parenthesis } and may span
multiple lines,c losed by matching parethesis}
next content";

while ($str =~ /\nkeyword:  
            (?| # branch reset: i.e. the two capture groups have the same number
                \s*
                ({ (?> [^{}]++ | (?1) )*+ }) # recursive pattern
              |               # OR
                \h*
                (.*+)   # capture all until the end of line
            )   # close the branch reset group
             /xg ) {

    print "$1\n";
}

此模式尝试使用嵌套花括号的可能内容,如果未找到大括号或不平衡,则尝试第二种替代方法并仅匹配该行的内容(因为该点不能与换行匹配)。 / p>

分支重置功能(?|..|..)可用于为交替的每个部分的捕获组提供相同的编号。

递归模式细节:

(                 # open the capturing group 1
    {             # literal opening curly bracket
    (?>           # atomic group: possible content between brackets
        [^{}]++   # all that is not a curly bracket
      |           # OR
        (?1)      # recurse to the capturing group 1 (!here is the recursion!)
    )*+           # repeat the atomic group zero or more times
    }             # literal closing curly bracket
)                 # close the capturing group 1

在此子模式中,我使用atomic group (?>...)possessive quantifiers ++以及*+来避免尽可能回溯。

答案 1 :(得分:0)

这样的事情怎么样?

if ($str =~ /keyword:\s*{(.*)}/s) {
    my $key = $1;
    if ($key =~ /([^{}]*)/) {
        print "$1\n";
    }
    else {
        print "$key\n";
    }
}
elsif ($str =~ /keyword:\s*(.*)/) {
    print "$1\n";
}

[^{|^}]正在寻找一大块没有任何括号的字母,即嵌套大括号的最内部字母。

即使使用s.*修饰符也可让您查看多行。但是,您不希望为没有大括号的关键字查看多行,因此该部分位于elsif语句中。

您需要具有相同数量的匹配括号吗?例如,keyword: {foo{bar{hello}}}输出{{{hello}}}应该是什么?如果是这样,我觉得坚持使用计数器会更好。

编辑:

输入

keyword: {multiline 
with nested {parenthesis} }

如果你想要输出

{multiline with nested {parenthesis} }

我相信那会是

if ($str =~ /keyword:\s*({.*})/s) {
    my $match = $1;
    $match =~ s/\n//g;
    print "$match\n";
}
elsif ($str =~ /keyword:\s*(.*)/) {
    print "$1\n";
}