Question

$rvsfile是文件的路径大约200M。我想计算其中$userid的行数。但是在while循环中使用grep似乎非常缓慢。那么有没有有效的方法来做到这一点？由于$rvsfile非常大，我无法使用@tmp = <FILEHANDLE>将其读入内存。

while(defined($line = <SRCFILE>))
{
    $line =~ /^([^\t]*)\t/;
    $userid = $1;
    $linenum = `grep '^$userid\$' $rvsfile | wc -l`;
    chomp($linenum);
    print "$userid $linenum\n";
    if($linenum == 0)
    {
        print TARGETFILE "$line";
    }
}

如何在没有\t的行中regex之前获取该部分？例如，该行可能是这样的：

2013123 \t的东西

如果没有正则表达式，我怎样才能获得2013123

Answer 1

是的，你是在每个循环调用上分配一个shell。这很慢。您还可以为每个用户阅读整个$rsvfile一次。这太过分了。

阅读SRCFILE一次并构建@userids。
在您保持每个用户ID的运行计数时，请先阅读$rvsfile。

草图：

my @userids;

while(<SRCFILE>)
{
    push @userids, $1 if /^([^\t]*)\t/;
}

my $regex = join '|', @userids;
my %count;

while (<RSVFILE>)
{
     ++$count{$1} if /^($regex)$/o
}

# %count has everything you need...

Answer 2

您可以使用index搜索第一个\ t的位置，这样会更快。然后，您可以使用splice来获得匹配。

建议你benchmark各种方法。

Answer 3

如果我正确地读了你，你想要这样的东西：

#!/usr/bin/perl

use strict;
use warnings;

my $userid = 1246;
my $count = 0;

my $rsvfile = 'sample';

open my $fh, '<', $rsvfile;

while(<$fh>) {
  $count++ if /$userid/;
}

print "$count\n";

甚至，（如果我错了，有人会纠正我，但这并不认为这会读取整个文件）：

#!/usr/bin/perl

use strict;
use warnings;

my $userid = 1246;

my $rsvfile = 'sample';

open my $fh, '<', $rsvfile;

my $count = grep {/$userid/} <$fh>;

print "$count\n";

Answer 4

如果<SRCFILE>相对较小，你可以反过来做。一次读取一行中的较大文件，并检查每行的每个用户ID，使用散列结构保留每个用户ID的计数。类似的东西：

my %userids = map {($_, 0)}                # use as hash key with init value of 0
              grep {$_}                    # only return mataches
              map {/^([^\t]+)/} <SRCFILE>; # extract ID

while (defined($line = <LARGEFILE>)) {
    for (keys %userids) {
        ++$userids{$_} if $line =~ /\Q$_\E/; # \Q...\E escapes special chars in $_
    }
}

这样，只重复读取较小的数据，扫描大文件一次。最终得到每个用户标识的哈希值，该值是它出现的行数。

Answer 5

使用哈希：

my %count;
while (<LARGEFILE>) {
    chomp;
    $count{$_}++;
};
# now $count{userid} is the number of occurances 
# of $userid in LARGEFILE

或者，如果您担心为哈希使用太多内存（即您对6个用户感兴趣，并且大文件中还有100K以上），请以另一种方式执行此操作：

my %count;
while (<SMALLFILE>) {
    /^(.*?)\t/ and $count{$_} = 0;
};

while (<LARGEFILE>) {
    chomp;
    $count{$_}++ if defined $count{$_};
};
# now $count{userid} is the number of occurances 
# of $userid in LARGEFILE, *if* userid is in SMALLFILE

Answer 6

如果您有选择，请尝试使用awk

awk 'FNR==NR{a[$1];next} { for(i in a) { if ($0 ~ i) { print $0} } } ' $SRCFILE $rsvfile

有没有更好的方法来从大文件中“grep”而不是在perl中使用`grep`？

6 个答案: