我想在文件上进行模式匹配(大约200兆字节),然后在数组中输入匹配的行,并在每个匹配行之前和之后输入任意数量的行。
sub1,使用perl grep,需要11秒
sub2,使用unix egrep,1秒
sub6(ack)50秒(如果不使用\ b,\ s锚点等则会更快)
命令行中的ack需要15秒
我对加快sub1的建议感兴趣,或者找到一种不依赖外部工具的快速perl解决方案
似乎perl grep比unix更慢。
“index”真的比正则表达式快(但我需要\ b,\ s等)
http://www.perlmonks.org/?node_id=885174
http://www.perlmonks.org/?node_id=957554
感谢
use 5.014;
use strict;
use warnings;
use Time::HiRes qw(usleep ualarm gettimeofday tv_interval);
use List::MoreUtils qw(uniq);
open FILE, '<textMatchInAfile.txt' or die;
my $p = '\bsala|che|relazione|di|questo|coso|^qui\$';
my $mR = 1; #print more rows before - after the matching
my @n = <FILE>;
&sub1( $p, $mR, @n ); #suggest: insert references
&sub3( $p, $mR );
sub sub1 { #questa sub usa perl grep
my $p = $_[0]; #pattern
my $mR = $_[1]; #more rows
my @n = @_[ 2 .. $#_ ]; #input File
my $time = [gettimeofday];
my @new = grep { $n[$_] =~ /$p/ } 0 .. $#n;
my @unique =
map { @n[ $_ - $mR .. $_ + $mR ] } @new[ 0 + $mR .. $#new - $mR];
say "\n" . 'time sub1 perl grep: ' . tv_interval($time);
@unique = uniq(@unique);
say "sub 1 $#unique";
}
sub sub3 { #unix grep with color and line numbers
my $p = $_[0];
my $mR = $_[1];
my $cmd = "grep -n -C $mR"; #with line numbers
$p =~ s/\|/ /g;
$p =~ s/\h+/" -e "/g;
$p = ' -e "' . $p . '" ';
say "cmd ===$cmd=== ss ===$p===";
my @values;
$values[0] = $p;
$values[1] = ( ' ' . 'textMatchInAfile.txt' );
my $time = [gettimeofday];
my @valori = `$cmd @values` or die "system @values` failed: $?";
say 'sub3 egrep shell: ' . $#valori;
say 'time sub3 tempo trovati con egrep shell ' . tv_interval($time);
my @uniq_list = uniq(@valori);
}
sub sub6 { #perl ack
my $p = $_[0]; #pattern
my $mR = $_[1]; #more rows
my @values;
my $time = [gettimeofday];
my @valori = qx (ack -C $mR "$p" textMatchInAfile.txt)
or die "system @values` failed: $?";
say 'number of values found with ack' . $#valori;
say 'time sub6 ack' . tv_interval($time);
}
#
#this one takes 11 seconds
use 5.014;
use warnings;
use Time::HiRes qw(usleep ualarm gettimeofday tv_interval);
my @array;
my $pattern = '\bsala|che|relazione|di|questo|coso|^qui\$';
open( my $filehandle, "<textMatchInAfile.txt" );
my $time = [gettimeofday];
while (<$filehandle>) {
if ( $_ =~ /$pattern/ ) {
push @array;
}
}
say 'time while' . tv_interval($time);
好吧,unix grep比perl grep快一个数量级,我会接受它。
答案 0 :(得分:3)
为什么不使用grep -B 1 -A 1?
这可以为您提供所需的确切输出。
grep -B 1 -A 1 -E patter file
此致
答案 1 :(得分:1)
我已经完成了Unix'egrep
和Perl grep
命令的基本比较,后者有两种不同的实现。
use Benchmark qw(cmpthese);
my $count = $ARGV[0] || 100;
my $re = "L[aeiou]n*.?[xyz]\\b";
cmpthese($count, {
unix => sub {
my $result = `dmesg|egrep '$re'`;
#print "===unix===\n";
#print $result;
},
perl => sub {
my @result = grep {$_ =~ m/$re/} split m/\n/, `dmesg`;
#print "===perl===\n";
#map {print "$_\n"} @result;
},
perl2 => sub {
open(DMESG, "dmesg|" ) or die "cannot open dmesg pipe!";
my @result;
while(<DMESG>) {
push @result, $_ if m/$re/;
}
#print "===perl2===\n";
#map {print} @result;
close DMESG;
},
});
结果:
$ perl grep.pl 1000
Rate unix perl perl2
unix 24.6/s -- -40% -44%
perl 41.0/s 67% -- -6%
perl2 43.6/s 77% 6% --
所以请解释为什么Perl的grep自然比Unix grep
慢。
PS我改编了脚本,在一个包含25k行随机数据和不同RE的文件上运行。这种情况与你的情况有点类似。
$ perl tmp/grep.pl 1000
Rate unix perl perl2
unix 3.71/s -- -32% -44%
perl 5.50/s 48% -- -17%
perl2 6.64/s 79% 21% --