可能是基本问题,但找不到答案:
我有一个模式匹配正则表达式,我在一个大缓冲区中寻找'G',187字符和'G'。这适用于Ctrl+F12
。有时我想在搜索中添加$s =~ m/(G.{187}G)/s
字节的偏移量(我不想从缓冲区的第0位开始)。我现在可以做N
但这对我来说听起来不是很有效,因为我不想解析所有的开始缓冲区(它可能很大)。我尝试使用$s =~ m/.{N}(G.{187}G)/s
,但无法将其设置为正确。
由于
答案 0 :(得分:1)
据我了解,效率没有差别。根据我对有限自动机(正则表达式使用)的理解,无论如何,每个角色都必须循环。 (否则,你怎么知道你的比赛何时开始?)任何正则表达式(假设它已经被编译)将在线性时间内执行。
如果要跳过特定数量的字符,首先从指定位置开始的子字符串,然后将正则表达式应用于该子字符串可能不是一个坏主意。
答案 1 :(得分:1)
更新基准测试:跳过超过几百个字符pos
变得更快。
您可以使用pos设置“最后一场比赛”(之后)的位置。例如
$_ = q(abcdef);
while (/(.)/g)
{
say "Got $1 at pos ", pos();
if (++$cnt == 1) { pos = 4 }
}
打印
Got a at pos 1 Got e at pos 5 Got f at pos 6
你可以根据需要在匹配之前设置它
$_ = q(abcdef);
pos = 4;
while (/(.)/g) { say "Got $1 at pos ", pos() }
与
Got e at pos 5 Got f at pos 6
你的直接问题是什么回答。 \G
不影响任何此类内容。
但是我不确定这是否比你所拥有的更好,因为.
是如此简单的“匹配”,.{N}
正则表达式引擎优化器可以直接索引到N+1
搜索以下模式。
事实证明,这对于要跳过的序列的长度很敏感(对于很长的序列来说至关重要)。有了数百个字符,基准测试有利于.{N}
,在我的测试中高达25%。但是从大约1_000
跳过的字符开始,结果迅速反转。在笔记本电脑上以v5.16跑。
use warnings;
use strict;
use feature 'say';
use Benchmark qw(cmpthese);
my $N = 10_000; # 100;
my $rep = 1;
# Insert at the beginning a phrase that also matches, for a test
my $str = 'G'.'a'x$rep.'G' . 'b'x($N-2-$rep) . 'G'.'X'x$rep.'G';
sub posit {
my ($str, $N, $rep) = @_;
pos ($str) = $N;
my ($res) = $str =~ /(G.{$rep}G)/g;
return $res;
}
sub match {
my ($str, $N, $rep) = @_;
my ($res) = $str =~ /.{$N}(G.{$rep}G)/;
return $res;
}
say "posit: ", posit($str,$N,$rep), " and match: ", match($str,$N,$rep);
say "Benchmark skipping first $N positions\n";
cmpthese(-10, {
posit => sub { my $res = posit ($str, $N, $rep) },
match => sub { my $res = match ($str, $N, $rep) },
});
请注意,我们必须在/g
的正则表达式中pos
才能生效。结果
posit: GXG and match: GXG Benchmark skipping first 10000 positions Rate match posit match 125252/s -- -70% posit 414886/s 231% --
有趣的是,使用硬编码的GXG
字符串会在所有情况下显着改变结果(并且总是有利于获胜者)。但是,此参数可能会被传递和插值,因此我仍然使用$rep
。正如预期的那样,改变它没有任何效果,所以我简短而简单。
最后,问题指定了“大缓冲区”。当我将$N
设置为100_000
时,我得到了
Quantifier in {,} bigger than 32766 in regex; marked by ... 100000}(G.{1}G)/ at ...
因此,对于非常大的缓冲区,必须使用pos()
。
请注意,pos
也有副作用,影响进一步的操作。查看文档。
答案 2 :(得分:1)
我测试了两种方法的速度:
use feature qw(say);
use strict;
use warnings;
use Benchmark qw(:all);
use Getopt::Long qw(GetOptions);
GetOptions( "case=i" => \my $case ) or die("Error in command line arguments\n");
my $num_cases = 5;
sub case1 { return (187, 2000) };
sub case2 { return (5, 10000) };
sub case3 { return (5, 20000) };
sub case4 { return (10000, 20000) };
sub case5 { return (5, 40000) };
my @cases = ( $case );
if ( !defined $case ) {
@cases = 1..$num_cases;
}
for my $case ( @cases ) {
run_case( $case );
}
sub run_case {
my ( $case ) = @_;
my $case_coderef = \&{"case" . $case};
my ( $M, $N ) = $case_coderef->();
say "Running case $case: \$M = $M, \$N = $N";
my $prefix = 'A' x $N;
my $middle = 'B' x $M;
my $match_str = 'G' . $middle . 'G';
my $str = $prefix . $match_str;
my %methods = map {; "method$_" => \&{"method" . $_} } 1..6;
for my $meth (keys %methods) {
my $res = eval {
$methods{$meth}->($str, $M, $N)
};
if ( $@ ) {
print "$@";
say "Skipping method '$meth'..";
delete $methods{$meth};
next;
}
die "Method '$meth' failed.\n" if $res ne $match_str;
}
my %code = map { $_ => eval ('sub { $methods{' . $_ . '}->($str, $M, $N) }') } sort keys %methods;
cmpthese(-5, \%code );
}
sub method1 {
my ( $str, $M, $N ) = @_;
$str =~ m/.{$N}(G.{$M}G)/;
return $1;
}
sub method2 {
my ( $str, $M, $N ) = @_;
pos( $str ) = $N;
$str =~ m/\G(G.{$M}G)/;
return $1;
}
sub method3 {
my ( $str, $M, $N ) = @_;
pos( $str ) = $N;
$str =~ m/(G.{$M}G)/g;
return $1;
}
sub method4 {
my ( $str, $M, $N ) = @_;
$str =~ m/.{$N}(G.{$M}G)/s;
return $1;
}
sub method5 {
my ( $str, $M, $N ) = @_;
pos( $str ) = $N;
$str =~ m/\G(G.{$M}G)/s;
return $1;
}
sub method6 {
my ( $str, $M, $N ) = @_;
pos( $str ) = $N;
$str =~ m/(G.{$M}G)/gs;
return $1;
}
<强>输出强>:
Running case 1: $M = 187, $N = 2000
Rate method1 method3 method2 method6 method5 method4
method1 696485/s -- -37% -39% -44% -46% -57%
method3 1112322/s 60% -- -3% -10% -13% -32%
method2 1146132/s 65% 3% -- -7% -10% -30%
method6 1234678/s 77% 11% 8% -- -3% -24%
method5 1278898/s 84% 15% 12% 4% -- -21%
method4 1629148/s 134% 46% 42% 32% 27% --
Running case 2: $M = 5, $N = 10000
Rate method1 method2 method6 method5 method3 method4
method1 226784/s -- -72% -72% -72% -72% -78%
method2 801020/s 253% -- -0% -2% -3% -23%
method6 802386/s 254% 0% -- -1% -3% -23%
method5 814132/s 259% 2% 1% -- -1% -22%
method3 823653/s 263% 3% 3% 1% -- -21%
method4 1046605/s 361% 31% 30% 29% 27% --
Running case 3: $M = 5, $N = 20000
Rate method1 method3 method2 method6 method5 method4
method1 122763/s -- -90% -90% -90% -90% -92%
method3 1252858/s 921% -- -0% -1% -2% -23%
method2 1258330/s 925% 0% -- -1% -1% -22%
method6 1265165/s 931% 1% 1% -- -1% -22%
method5 1274309/s 938% 2% 1% 1% -- -21%
method4 1622943/s 1222% 30% 29% 28% 27% --
Running case 4: $M = 10000, $N = 20000
Rate method1 method3 method2 method6 method5 method4
method1 90687/s -- -62% -62% -93% -93% -94%
method3 236835/s 161% -- -1% -81% -81% -86%
method2 238334/s 163% 1% -- -81% -81% -85%
method6 1236025/s 1263% 422% 419% -- -3% -25%
method5 1270943/s 1301% 437% 433% 3% -- -22%
method4 1638548/s 1707% 592% 588% 33% 29% --
Running case 5: $M = 5, $N = 40000
Quantifier in {,} bigger than 32766 in regex; marked by <-- HERE in m/.{ <-- HERE 40000}(G.{5}G)/ at ./p.pl line 83.
Skipping method 'method4'..
Quantifier in {,} bigger than 32766 in regex; marked by <-- HERE in m/.{ <-- HERE 40000}(G.{5}G)/ at ./p.pl line 63.
Skipping method 'method1'..
Rate method2 method3 method6 method5
method2 1253528/s -- -1% -1% -2%
method3 1260746/s 1% -- -0% -1%
method6 1263378/s 1% 0% -- -1%
method5 1278718/s 2% 1% 1% --
我使用运行Ubuntu 16.10的Intel(R)Core(TM)i7-7500U CPU @ 2.70GHz在我的笔记本电脑上运行此操作,结果显示使用s
修饰符可以显着加快method1
(与method4
比较)。但method1
超过32766时无法使用method4
或$N
。