Question

我正在使用16GB文件和一个小文件。

我尝试将两个文件加载到内存中。然后，我移动了大文件中的每一行并验证了小文件中的某些内容（对于我在小文件中迭代的大文件中的每一行）。

这是我的代码

local $/ = undef;
open my $fh1, '<', $in or die "error opening $in: $!";
my $input_file = do { local $/; <$fh1> };

local $/ = undef;
open my $fh2, '<', $handle or die "error opening $handle: $!";
my $handle_file = do { local $/; <$fh2> };

my $counter_yes = 0;
my $counter_no  = 0;
my $flag        = 0;

my @lines1 = split /\n/, $input_file;

foreach my $line( @lines1 ) {

    my @f = split('\t', $line); # $f[0] and $f[1]
    print "f0 and f1 are: $f[0] and $f[1]\n";
    my @lines2 = split /\n/, $handle_file;

    foreach my $input ( @lines2 ){

        #print "line2 is: $input\n";
        my @sp = split /:/, $input; # $sp[0] and $sp[1]

        if ( $sp[0] eq $f[0] ){

            my @r = split /-/, $sp[1];

            if ( ($f[1] >= $r[0]) && ($f[1] <= $r[1]) ){
                $flag = 1;
                $counter_yes = $counter_yes;
                last;
            }
        }
    }

    if ( $flag == 0 ){
        $counter_no = $counter_no  ;
    }
}

我运行时遇到错误

Split loop at script.pl line 30, <$fh2> chunk 1

可能是什么原因？

Answer 1

您可以运行perldoc perldiag来了解内置错误和警告的含义。

   Split loop
       (P) The split was looping infinitely.  (Obviously, a split
       shouldn't iterate more times than there are characters of input,
       which is what happened.)  See "split" in perlfunc.

你分裂的字符串是如此之大，Perl认为它是无限迭代的。当Perl将字符串分割的次数多于字符串+10的长度时，它会假定它处于无限循环中而给出此错误。不幸的是，它将该数字存储为32位整数，最多只能容纳20亿并且会发生变化。你的字符串超过160亿，所以结果将是不可预测的。

5.20中的recently fixed以及处理大小超过2G的字符串的许多其他相关问题。因此，如果您升级Perl，您的代码将“正常工作”。

但是，您的代码效率极低，并且会破坏大多数计算机的内存，导致它在切换到磁盘时速度变慢。至少你应该只是在小文件中啜饮并逐行读取16 gig文件。

my @small_data = <$small_fh>;
chomp @small_data;

while( my $big = <$big_fh> ) {
    chomp $big;

    for my $small (@small_data) {
        ...
    }
}

但即使这样效率也非常低，如果你的小文件包含1000行，那么该循环将运行16万亿次！

由于您似乎正在检查大文件中的条目是否在小文件中，因此最好将小文件中的条目转换为哈希表。

my %fields;
while( my $line = <$small_fh> ) {
    chomp $line;
    my @sp = split /:/, $line;
    $fields{$sp[0]} = $sp[1];
}

现在，您可以遍历大文件并进行哈希查找。

while( my $line = <$big_fh> ) {
    chomp $line;
    my @f = split('\t', $line);

    if( defined $fields{$f[0]} ) {
        ...
    }
}

Answer 2

为什么你要将整个文件读成一个大字符串并将其拆分成一个行数组，当你可以将它读入一行数组开始时？为什么你一遍又一遍地为第二个文件做？你可以

chomp(my @lines1 = <$fh>);
chomp(my @lines2 = <$fh2>);

位于您的计划顶部，并删除了未使用过的$input_file和$handle_file以及所有$/无意义的内容。这很可能是问题的根源，因为错误消息表明分裂正在产生太多＆＃34;字段。

Answer 3

我正在处理一个16GB的文件和一个小文件。

我尝试将这两个文件加载到内存中。

你有16GB的内存吗？实际上，您的代码需要超过32GB的内存。

在script.pl第30行，第1行
处拆分循环

我无法复制该错误。 Perl错误通常是非常具有描述性的，但这甚至无法理解。

接下来，如果你的代码中有这个：

my $x = 10;
#nothing changes $x
#in these
#lines
$x = 10;

最后一行的目的是什么？然而，你这样做了：

$/ = undef;
#Nothing changes $/
#in these lines
$/ = undef;

接下来，所有perl程序都应该从以下几行开始：

<guess>

如果您不知道，那么您需要购买一本开始的perl书。

perl用大文件操作

3 个答案: