grep线匹配一个模式,以及匹配前后的行

时间:2012-03-06 12:08:26

标签: perl grep

我想在文件上进行模式匹配(大约200兆字节),然后在数组中输入匹配的行,并在每个匹配行之前和之后输入任意数量的行。

sub1,使用perl grep,需要11秒

sub2,使用unix egrep,1秒

sub6(ack)50秒(如果不使用\ b,\ s锚点等则会更快)

命令行中的

ack需要15秒

我对加快sub1的建议感兴趣,或者找到一种不依赖外部工具的快速perl解决方案

似乎perl grep比unix更慢。

“index”真的比正则表达式快(但我需要\ b,\ s等)

http://www.perlmonks.org/?node_id=885174

http://www.perlmonks.org/?node_id=957554

感谢

use 5.014;
use strict;
use warnings;
use Time::HiRes qw(usleep ualarm gettimeofday tv_interval);
use List::MoreUtils qw(uniq);

open FILE, '<textMatchInAfile.txt' or die;
my $p = '\bsala|che|relazione|di|questo|coso|^qui\$';
my $mR = 1;        #print more rows before - after the matching
my @n  = <FILE>;

&sub1( $p, $mR, @n );    #suggest: insert references
&sub3( $p, $mR );

sub sub1 {               #questa sub usa perl grep
my $p    = $_[0];             #pattern
my $mR   = $_[1];             #more rows
my @n    = @_[ 2 .. $#_ ];    #input File
my $time = [gettimeofday];
my @new = grep { $n[$_] =~ /$p/ } 0 .. $#n;
my @unique =
  map { @n[ $_ - $mR .. $_ + $mR ] } @new[ 0 + $mR .. $#new - $mR];
say "\n" . 'time sub1 perl grep: ' . tv_interval($time);
@unique = uniq(@unique);
say "sub 1 $#unique";
}

sub sub3 {    #unix grep with color and line numbers
my $p   = $_[0];
my $mR  = $_[1];
my $cmd = "grep -n -C $mR";    #with line numbers
$p =~ s/\|/ /g;
$p =~ s/\h+/" -e "/g;
$p = ' -e "' . $p . '" ';
say "cmd ===$cmd=== ss ===$p===";
my @values;
$values[0] = $p;
$values[1] = ( ' ' . 'textMatchInAfile.txt' );    
my $time = [gettimeofday];
my @valori = `$cmd @values` or die "system @values` failed: $?";
say 'sub3 egrep shell: ' . $#valori;
say 'time sub3 tempo trovati con egrep shell ' . tv_interval($time);
my @uniq_list = uniq(@valori);
}

sub sub6 {             #perl ack
my $p  = $_[0];    #pattern
my $mR = $_[1];    #more rows
my @values;
my $time   = [gettimeofday];
my @valori = qx (ack -C $mR "$p" textMatchInAfile.txt)
  or die "system @values` failed: $?";
say 'number of values found with ack' . $#valori;
say 'time sub6 ack' . tv_interval($time);
}
#this one takes 11 seconds

 use 5.014;
 use warnings;
 use Time::HiRes qw(usleep ualarm gettimeofday tv_interval);

 my @array;
 my $pattern = '\bsala|che|relazione|di|questo|coso|^qui\$';
 open( my $filehandle, "<textMatchInAfile.txt" );
 my $time = [gettimeofday];
 while (<$filehandle>) {
     if ( $_ =~ /$pattern/ ) {
    push @array;
     }
 }
 say 'time while' . tv_interval($time);

好吧,unix grep比perl grep快一个数量级,我会接受它。

2 个答案:

答案 0 :(得分:3)

为什么不使用grep -B 1 -A 1?

这可以为您提供所需的确切输出。

grep -B 1 -A 1 -E patter file

此致

答案 1 :(得分:1)

我已经完成了Unix'egrep和Perl grep命令的基本比较,后者有两种不同的实现。

use Benchmark qw(cmpthese);

my $count = $ARGV[0] || 100;

my $re = "L[aeiou]n*.?[xyz]\\b";

cmpthese($count, {
    unix => sub {
        my $result = `dmesg|egrep '$re'`;

        #print "===unix===\n";
        #print $result;
    },
    perl => sub {
        my @result = grep {$_ =~ m/$re/} split m/\n/, `dmesg`;

        #print "===perl===\n";
        #map {print "$_\n"} @result;
    },
    perl2 => sub {
        open(DMESG, "dmesg|" ) or die "cannot open dmesg pipe!";

        my @result;

        while(<DMESG>) {
            push @result, $_ if m/$re/;
        }

        #print "===perl2===\n";
        #map {print} @result;

        close DMESG;
    },
});

结果:

$ perl grep.pl 1000
        Rate  unix  perl perl2
unix  24.6/s    --  -40%  -44%
perl  41.0/s   67%    --   -6%
perl2 43.6/s   77%    6%    --

所以请解释为什么Perl的grep自然比Unix grep慢。

PS我改编了脚本,在一个包含25k行随机数据和不同RE的文件上运行。这种情况与你的情况有点类似。

$ perl tmp/grep.pl 1000
        Rate  unix  perl perl2
unix  3.71/s    --  -32%  -44%
perl  5.50/s   48%    --  -17%
perl2 6.64/s   79%   21%    --