Question

有一个将文件存储在数组中的进程。不幸的是，当文件太大（假设800K行或超过60 MB）时，会返回错误，如“内存不足！”。这有什么解决方案吗？例如，以下代码抛出“Out of memory！”。

#! /usr/bin/perl

die unless (open (INPUT, "input.txt"));
@file=<INPUT>;                     # It fails here 
print "File stored in array\n";    # It never reaches here
$idx=0;
while ($idx < @file) {
    $idx++;
}
print "The line count is = $idx\n";

Answer 1

我会使用Tie::File：

use Tie::File;
my @file

tie @file, 'Tie::File', "input.txt";

print "File reflected in array\n";
print "The line count is ", scalar(@file);

Answer 2

大多数情况下，您不需要一次读取整个文件。在标量上下文中调用时，readline运算符一次只返回一行：

1 while <INPUT>;   # read a line, and discard it.
say "The line count is = $.";

$.特殊变量是最后一个读取文件句柄的行号。

编辑：行计数只是一个例子

Perl对大型阵列没有问题，看起来您的系统似乎没有足够的内存。请注意，Perl数组比C数组使用更多内存，因为标量会为标志等分配额外的内存，并且因为数组会逐步增加。

如果内存存在问题，则必须将算法从必须将整个文件加载到内存中的算法转换为一次只保留一行的算法。

示例：对一个千兆字节的文件进行排序。正常方法print sort <$file>在这里不起作用。相反，我们对文件的某些部分进行排序，将它们写入临时文件，然后以巧妙的方式在临时文件之间切换以生成一个已排序的输出：

use strict; use warnings; use autodie;

my $blocksize = shift @ARGV; # take lines per tempfile as command line arg

mkdir "/tmp/$$";  # $$ is the process ID variable

my $tempcounter = 0;
my @buffer;
my $save_buffer = sub {
    $tempcounter++;
    open my $tempfile, ">", "/tmp/$$/$tempcounter";
    print $tempfile sort @buffer;
    @buffer = ();
};
while (<>) {
  push @buffer, $_;
  $save_buffer->() if $. % $blocksize == 0;
}
$save_buffer->();

# open all files, read 1st line
my @head =
  grep { defined $_->[0] }
  map { open my $fh, "<", $_; [scalar(<$fh>), $fh] }
  glob "/tmp/$$/*";

# sort the line-file pairs, pick least
while((my $least, @head) = sort { $a->[0] cmp $b->[0] } @head){
  my ($line, $fh) = @$least; print $line;

  # read next line
  if (defined($line = <$fh>)){
    push @head, [$line, $fh];
  }           
}

# clean up afterwards
END {
  unlink $_ for glob "/tmp/$$/*";
  rmdir "/tmp/$$";
}

可以像$ ./sort-large-file 10000 multi-gig-file.txt >sorted.txt一样调用。

这种一般方法可以应用于各种问题。这是一种“分而治之”的策略：如果问题太大，解决一个较小的问题，然后将这些问题合并起来。

无法在阵列中存储大文件？

2 个答案:

编辑：行计数只是一个例子