Perl Regex多重匹配

时间:2013-03-07 18:44:35

标签: regex perl backreference

我正在寻找一个表现如下的正则表达式:

  

输入:“你好世界。”

     

输出:he,el,ll,lo,wo或rl,ld

我的想法与

有关
    while($string =~ m/(([a-zA-Z])([a-zA-Z]))/g) {
        print "$1-$2 ";
    }

但这确实有点不同。

5 个答案:

答案 0 :(得分:10)

这很棘手。你必须抓住它,保存它,然后强制回溯。

你可以这样做:

use v5.10;   # first release with backtracking control verbs

my $string = "hello, world!";
my @saved;

my $pat = qr{
    ( \pL {2} )
    (?{ push @saved, $^N })
    (*FAIL)
}x;

@saved = ();
$string =~ $pat;
my $count = @saved;
printf "Found %d matches: %s.\n", $count, join(", " => @saved);

产生这个:

Found 8 matches: he, el, ll, lo, wo, or, rl, ld.

如果您没有v5.10,或者您头疼,可以使用:

my $string = "hello, world!";
my @pairs = $string =~ m{
  # we can only match at positions where the
  # following sneak-ahead assertion is true:
    (?=                 # zero-width look ahead
        (               # begin stealth capture
            \pL {2}     #       save off two letters
        )               # end stealth capture
    )
  # succeed after matching nothing, force reset
}xg;

my $count = @pairs;
printf "Found %d matches: %s.\n", $count, join(", " => @pairs);

产生与以前相同的输出。

但你可能还是会头疼。

答案 1 :(得分:5)

无需“强行回溯”!

push @pairs, "$1$2" while /([a-zA-Z])(?=([a-zA-Z]))/g;

虽然您可能希望匹配任何字母而不是您指定的有限集。

push @pairs, "$1$2" while /(\pL)(?=(\pL))/g;

答案 2 :(得分:1)

另一种方法。不使用任何正则表达式魔术,它确实使用嵌套的map,但如果需要,这可以很容易地转换为for循环。

#!/usr/bin/env perl

use strict;
use warnings;

my $in = "hello world.";
my @words = $in =~ /(\b\pL+\b)/g;

my @out = map {
  my @chars = split '';
  map { $chars[$_] . $chars[$_+1] } ( 0 .. $#chars - 1 );
} @words;

print join ',', @out;
print "\n";

同样,对我而言,这比一个奇怪的正则表达式YMMV更具可读性。

答案 3 :(得分:0)

我会在前瞻中使用捕获的group

(?=([a-zA-Z]{2}))
    ------------
         |->group 1 captures two English letters 

试试here

答案 4 :(得分:0)

您可以通过查找字母并使用pos函数来使用捕获位置,\G在另一个正则表达式中引用它,并substr来读取字符串中的几个字符。

use v5.10;
use strict;
use warnings;

my $letter_re = qr/[a-zA-Z]/;

my $string = "hello world.";
while( $string =~ m{ ($letter_re) }gx ) {
    # Skip it if the next character isn't a letter
    # \G will match where the last m//g left off.
    # It's pos() in a regex.
    next unless $string =~ /\G $letter_re /x;

    # pos() is still where the last m//g left off.
    # Use substr to print the character before it (the one we matched)
    # and the next one, which we know to be a letter.
    say substr $string, pos($string)-1, 2;
}

您可以使用零宽度正面断言(?=pattern)将“检查下一个字母”逻辑放在原始正则表达式中。零宽度意味着它没有被捕获,也没有提升m//g正则表达式的位置。这有点紧凑,但零宽度断言变得棘手。

while( $string =~ m{ ($letter_re) (?=$letter_re) }gx ) {
    # pos() is still where the last m//g left off.
    # Use substr to print the character before it (the one we matched)
    # and the next one, which we know to be a letter.
    say substr $string, pos($string)-1, 2;
}

更新:我最初尝试将匹配和前瞻捕获为m{ ($letter_re (?=$letter_re)) }gx,但这不起作用。向前看是零宽度并且滑出比赛。其他人的答案显示,如果你在预测中放入第二个捕获,那么它可以崩溃到只是......

say "$1$2" while $string =~ m{ ($letter_re) (?=($letter_re)) }gx;

我在这里留下TMTOWTDI的所有答案,特别是如果你不是正则表达式的主人。