Question

使用以下规则在任意文本中需要匹配key = value对。

领先的行有一个结构：
- 从缩进开始 - ＆＃34;两个空格或标签＆＃34;在leas一次，例如：( |\t)+
- +个字符和一个空格
- 字VAR或CONST
- 以及使用key字符

示例：

  + VAR somename = somevalue (indented with two spaces)
        + VAR name3 = indented by one \t

以下正则表达式匹配这些行：

/^(  |\t)+\+\s+(VAR|CONST)\s+(\w+)\s*=\s*(.*)$/

现在问题：语法允许延续线，例如当上面的行后面跟着一行开始时，至少有一个缩进序列( |\t)（又称两个空格或一个制表符）被认为是一个连续行，它的整个内容（也带有前导空格）应该是{{ 1}}用于上一行中的键。

示例：

value

例如，延续行的正则表达式是

  + VAR multi = 3 line value where the continuation lines
  are indented (starts with two spaces or one tab)
  and NOT followed by the '+'

使用基于行的方法，解决方案很简单，例如当我将整个文本分成行并逐行处理时。

但是，我正在寻找一个（复杂的）正则表达式（主要用于学习和基准测试），它可以匹配一行或多行形式的键=值对。试过这个：

/^(  |\t)+([^\+](.*))$/

但我得到了：

while( $text =~ m/^(  |\t)+\+\s+(VAR|CONST)\s+(\w+)\s*=\s*((.*)$(?=(  |\t)+[^\+](.*)$)*)/gm ) {
    ...
}

附带问题：如何使用多行扩展正则表达式，例如：

(?=(  |\t)+[^\+](.*)$)* matches null string many times in regex; marked by <-- HERE in m/^(  |\t)+\+\s+(VAR|CONST)\s+(\w+)\s*=\s*((.*)$(?=(  |\t)+[^\+](.*)$)* <-- HERE )/ at so line 36.

当正则表达式必须包含完全空格字符时（例如，不能使用通用/ ^( |\t)+ # <- space ... :( \+\s+ (VAR|CONST) \s+ (\w+) \s*=\s* (.*)$ /x）？

如果有人需要帮助，这里有一个代码可以生成所需的输出（使用基于行的方法）以及非工作\s解决方案。

regex-based

编辑：使用已接受的答案，并添加所需的捕获组，获得以下内容：

#!/usr/bin/env perl
use 5.014;
use warnings;
use Data::Dumper;

my $txt = do { local $/; <DATA> };

my @matches1 = parse_by_lines($txt // '');
mydump('BY LINES', @matches1);

my @matches2 = parse_by_one_regex($txt // '');
mydump('REGEX', @matches2);

sub parse_by_lines { #produces the wanted output
    my ($text) = @_;
    my @match;
    my $havekey;
    for my $line (split "\n", $text) {
        if( $line =~ m/^(  |\t)+\+\s+(VAR|CONST)\s+(\w+)\s*=\s*(.*)$/ ) {
            push @match, { indent => $1, type => $2, key => $3, val => $4 };
            $havekey++;
        }
        elsif( $havekey && $line =~ m/^(  |\t)+([^\+](.*))$/ ) {    #continuation line
            $match[-1]->{val} .= "\n$line"; #prserve the \n in the val
        }
        else {
            $havekey = 0;
        }
    }
    return @match;
}


sub parse_by_one_regex { #not working
    my ($text) = @_;
    my @match;
    while( $text =~ m/^(  |\t)+\+\s+(VAR|CONST)\s+(\w+)\s*=\s*((.*)$(?=(  |\t)+[^\+](.*)$)*)/gm ) {
        push @match, { indent => $1, type => $2, key => $3, val => $4 };
    }
    return @match;
}

sub mydump {
    my($label, @match) = @_;
    say "#### $label ####";
    for my $m ( @match ) {
        printf "%-6s: [%s]\n", $_, $m->{$_} for (qw(indent type key val));
        print "\n";
    }
}

__DATA__
some arbitrary text lines
or empty lines

    could be indented
  and could contain any character

  + VAR name1 = var indented by two spaces and the first nonspace character is '+'
line of arbitrary text
    + VAR name2 = var indented by 2x2 spaces

    + VAR name3 = var indented by one \t
  + VAR name4 = the next line with "name5" is not valid. missing the = character, should not be matched
  + VAR name5
  + CONST name6 = the type could be VAR or CONST

  + VAR multi1 = multiline value where the continuation lines
  are indented (starts with two spaces or one tab) and NOT followed by the '+'

  + VAR multi1 = multiline value
    indented

  + VAR multi1 = multiline value
     indented ok too


  + VAR single = this is single line
  + because this line even if it is indented, the first nonspace character is '+'

  + VAR multi2 = multiline
  could be
     indented
        any way
  and any number of times
  until the first non-indented line

the following should NOT match

+ VAR some = sould not be matched, because the line isn't indented
 + VAR some = sould not be matched, because the line isn't indented at least with TWO spaces or one tab
  + SOME name = value not matched because the SOME isn't VAR or CONST

EDIT2 是的，基于正则表达式的解决方案速度提高了34％（至少在我的硬件上）。

Answer 1

正则表达式：

function checkTime(i) {
if (i < 10) {i = "0" + i}
return i;
}

Live demo

重要的部分是最后一个集群：

(?m)^(?:  +|\t+)\+ *(?:VAR|CONST) *\w+ *=.*(?:\R^(?>  +|\t+)[^+\s].*)*

回答您的第二个问题：

在设置(?: # Start of non-capturing group (a) \R # One line-break ^ # Start of line (?> +|\t+) # At least two spaces or one tab character (possessively) [^+\s] # Not followed by `+` or a newline character .* # Up to end of line )* # Repeat it as much as possible - end of non-capturing group (a)修饰符时，简单地忽略文字空格字符作为正则表达式的有意义部分，除非将其括在字符类x中并使用量词[ ]来表示它们应该的时间出现。

[ ]{2,}

<强> Live demo

正则表达式用于匹配缩进的延续行

1 个答案: