Perl正则表达式开始匹配搜索(跳过)N char

时间:2017-03-24 18:02:51

标签: regex perl offset skip

可能是基本问题,但找不到答案: 我有一个模式匹配正则表达式,我在一个大缓冲区中寻找'G',187字符和'G'。这适用于Ctrl+F12。有时我想在搜索中添加$s =~ m/(G.{187}G)/s字节的偏移量(我不想从缓冲区的第0位开始)。我现在可以做N但这对我来说听起来不是很有效,因为我不想解析所有的开始缓冲区(它可能很大)。我尝试使用$s =~ m/.{N}(G.{187}G)/s,但无法将其设置为正确。

由于

3 个答案:

答案 0 :(得分:1)

据我了解,效率没有差别。根据我对有限自动机(正则表达式使用)的理解,无论如何,每个角色都必须循环。 (否则,你怎么知道你的比赛何时开始?)任何正则表达式(假设它已经被编译)将在线性时间内执行。

如果要跳过特定数量的字符,首先从指定位置开始的子字符串,然后将正则表达式应用于该子字符串可能不是一个坏主意。

答案 1 :(得分:1)

更新基准测试:跳过超过几百个字符pos变得更快。

您可以使用pos设置“最后一场比赛”(之后)的位置。例如

$_ = q(abcdef); 

while (/(.)/g)  
{ 
    say "Got $1 at pos ", pos();        

    if (++$cnt == 1) { pos = 4 } 
}

打印

Got a at pos 1
Got e at pos 5
Got f at pos 6

你可以根据需要在匹配之前设置它

$_ = q(abcdef); 

pos = 4; 

while (/(.)/g) { say "Got $1 at pos ", pos() }

Got e at pos 5
Got f at pos 6

你的直接问题是什么回答。 \G不影响任何此类内容。

但是我不确定这是否比你所拥有的更好,因为.是如此简单的“匹配”,.{N}正则表达式引擎优化器可以直接索引到N+1搜索以下模式。

事实证明,这对于要跳过的序列的长度很敏感(对于很长的序列来说至关重要)。有了数百个字符,基准测试有利于.{N},在我的测试中高达25%。但是从大约1_000跳过的字符开始,结果迅速反转。在笔记本电脑上以v5.16跑。

use warnings;
use strict;
use feature 'say';
use Benchmark qw(cmpthese);

my $N = 10_000; # 100; 
my $rep = 1;
# Insert  at the beginning a phrase that also matches, for a test
my $str = 'G'.'a'x$rep.'G' . 'b'x($N-2-$rep) . 'G'.'X'x$rep.'G';

sub posit {
    my ($str, $N, $rep) = @_;
    pos ($str) = $N;
    my ($res) = $str =~ /(G.{$rep}G)/g;
    return $res;
}   

sub match {
    my ($str, $N, $rep) = @_;
    my ($res) = $str =~ /.{$N}(G.{$rep}G)/;
    return $res;
}   

say "posit: ", posit($str,$N,$rep), " and match: ", match($str,$N,$rep);
say "Benchmark skipping first $N positions\n";

cmpthese(-10, {
    posit => sub { my $res = posit ($str, $N, $rep) },
    match => sub { my $res = match ($str, $N, $rep) },
}); 

请注意,我们必须在/g的正则表达式中pos才能生效。结果

posit: GXG and match: GXG
Benchmark skipping first 10000 positions

          Rate match posit
match 125252/s    --  -70%
posit 414886/s  231%    --

有趣的是,使用硬编码的GXG字符串会在所有情况下显着改变结果(并且总是有利于获胜者)。但是,此参数可能会被传递和插值,因此我仍然使用$rep。正如预期的那样,改变它没有任何效果,所以我简短而简单。

最后,问题指定了“大缓冲区”。当我将$N设置为100_000时,我得到了

Quantifier in {,} bigger than 32766 in regex; marked by ... 100000}(G.{1}G)/ at ...

因此,对于非常大的缓冲区,必须使用pos()

请注意,pos也有副作用,影响进一步的操作。查看文档。

答案 2 :(得分:1)

我测试了两种方法的速度:

use feature qw(say);
use strict;
use warnings;

use Benchmark qw(:all);
use Getopt::Long qw(GetOptions);

GetOptions( "case=i" => \my $case ) or die("Error in command line arguments\n");

my $num_cases = 5;
sub case1 { return (187, 2000) };
sub case2 { return (5, 10000) };
sub case3 { return (5, 20000) };
sub case4 { return (10000, 20000) };
sub case5 { return (5, 40000) };

my @cases = ( $case );
if ( !defined $case ) {
    @cases = 1..$num_cases;
}
for my $case ( @cases ) {
    run_case( $case );
}

sub run_case {
    my ( $case ) = @_;

    my $case_coderef = \&{"case" . $case};
    my ( $M, $N ) = $case_coderef->();
    say "Running case $case: \$M = $M, \$N = $N";

    my $prefix = 'A' x $N;
    my $middle = 'B' x $M;
    my $match_str = 'G' . $middle . 'G';

    my $str = $prefix . $match_str;

    my %methods = map {; "method$_" => \&{"method" . $_} } 1..6;

    for my $meth (keys %methods) {
        my $res = eval {
            $methods{$meth}->($str, $M, $N)
        };
        if ( $@ ) {
            print "$@";
            say "Skipping method '$meth'..";
            delete $methods{$meth};
            next;
        }   
        die "Method '$meth' failed.\n" if $res ne $match_str;
    }

    my %code = map { $_ => eval ('sub { $methods{' . $_ . '}->($str, $M, $N) }') } sort keys %methods; 

    cmpthese(-5, \%code );
}

sub method1 {
    my ( $str, $M, $N ) = @_;
    $str =~ m/.{$N}(G.{$M}G)/;
    return $1;
}

sub method2 {
    my ( $str, $M, $N ) = @_;
    pos( $str ) = $N;
    $str =~ m/\G(G.{$M}G)/;
    return $1;
}

sub method3 {
    my ( $str, $M, $N ) = @_;
    pos( $str ) = $N;
    $str =~ m/(G.{$M}G)/g;
    return $1;
}

sub method4 {
    my ( $str, $M, $N ) = @_;
    $str =~ m/.{$N}(G.{$M}G)/s;
    return $1;
}

sub method5 {
    my ( $str, $M, $N ) = @_;
    pos( $str ) = $N;
    $str =~ m/\G(G.{$M}G)/s;
    return $1;
}

sub method6 {
    my ( $str, $M, $N ) = @_;
    pos( $str ) = $N;
    $str =~ m/(G.{$M}G)/gs;
    return $1;
}

<强>输出

Running case 1: $M = 187, $N = 2000
             Rate method1 method3 method2 method6 method5 method4
method1  696485/s      --    -37%    -39%    -44%    -46%    -57%
method3 1112322/s     60%      --     -3%    -10%    -13%    -32%
method2 1146132/s     65%      3%      --     -7%    -10%    -30%
method6 1234678/s     77%     11%      8%      --     -3%    -24%
method5 1278898/s     84%     15%     12%      4%      --    -21%
method4 1629148/s    134%     46%     42%     32%     27%      --
Running case 2: $M = 5, $N = 10000
             Rate method1 method2 method6 method5 method3 method4
method1  226784/s      --    -72%    -72%    -72%    -72%    -78%
method2  801020/s    253%      --     -0%     -2%     -3%    -23%
method6  802386/s    254%      0%      --     -1%     -3%    -23%
method5  814132/s    259%      2%      1%      --     -1%    -22%
method3  823653/s    263%      3%      3%      1%      --    -21%
method4 1046605/s    361%     31%     30%     29%     27%      --
Running case 3: $M = 5, $N = 20000
             Rate method1 method3 method2 method6 method5 method4
method1  122763/s      --    -90%    -90%    -90%    -90%    -92%
method3 1252858/s    921%      --     -0%     -1%     -2%    -23%
method2 1258330/s    925%      0%      --     -1%     -1%    -22%
method6 1265165/s    931%      1%      1%      --     -1%    -22%
method5 1274309/s    938%      2%      1%      1%      --    -21%
method4 1622943/s   1222%     30%     29%     28%     27%      --
Running case 4: $M = 10000, $N = 20000
             Rate method1 method3 method2 method6 method5 method4
method1   90687/s      --    -62%    -62%    -93%    -93%    -94%
method3  236835/s    161%      --     -1%    -81%    -81%    -86%
method2  238334/s    163%      1%      --    -81%    -81%    -85%
method6 1236025/s   1263%    422%    419%      --     -3%    -25%
method5 1270943/s   1301%    437%    433%      3%      --    -22%
method4 1638548/s   1707%    592%    588%     33%     29%      --
Running case 5: $M = 5, $N = 40000
Quantifier in {,} bigger than 32766 in regex; marked by <-- HERE in m/.{ <-- HERE 40000}(G.{5}G)/ at ./p.pl line 83.
Skipping method 'method4'..
Quantifier in {,} bigger than 32766 in regex; marked by <-- HERE in m/.{ <-- HERE 40000}(G.{5}G)/ at ./p.pl line 63.
Skipping method 'method1'..
             Rate method2 method3 method6 method5
method2 1253528/s      --     -1%     -1%     -2%
method3 1260746/s      1%      --     -0%     -1%
method6 1263378/s      1%      0%      --     -1%
method5 1278718/s      2%      1%      1%      --

我使用运行Ubuntu 16.10的Intel(R)Core(TM)i7-7500U CPU @ 2.70GHz在我的笔记本电脑上运行此操作,结果显示使用s修饰符可以显着加快method1 (与method4比较)。但method1超过32766时无法使用method4$N