如何在perl中传输50GB的大文件来为每一行(或块)执行一些正则表达式?我试过普通的香草
for $line (<FH>) {
# do regex
}
我也尝试过Tie :: File和File :: Stream,但perl总是试图将整个文件加载到内存中,这根本不可能。
#!/usr/bin/perl
use IO::Handle;
use Tie::File;
use File::Stream;
#tie @array, 'Tie::File', $ARGV[0] or die "could not open file";
STDOUT->autoflush(1);
$file=$ARGV[0];
open(INFO, "< $file") or die("Could not open file.");
print "opening ... \n";
my $stream = File::Stream->new(<INFO>);
#$out = $ARGV[1];
#open(my $OH, '>', $out) or die "Could not open file '$out' $!";
print "starting ... \n";
while (<$stream>) {
$line = $_;
$line =~ s/\n/\[!BR!\]/g;
$line =~ s/<page>/\n<page>/g;
$line =~ s/<\/page>/<\/page>\n/g;
print $line;
#STDOUT->flush();
}
close(INFO);
答案 0 :(得分:9)
正确的“普通香草”语法是
while (my $line = <FH>) { ...
你的for
循环确实会导致Perl首先将整个文件读入内存。
答案 1 :(得分:1)
我建议使用此PerlMonks page上列出的方法。
以下是该页面的示例:
# Set the character which will be used to indicate the end of a line.
# This defaults to the system's end of line character, but it doesn't
# hurt to set it explicitly, just in case some other part of your code
# has altered it from the default.
local $/ = "\n";
# Open the file for read access:
open my $filehandle, '<', 'myfile.txt';
my $line_number = 0;
# Loop through each line:
while (defined($line = <$filehandle>))
{
# The text of the line, including the linebreak
# is now in the variable $line.
# Keep track of line numbers
$line_number++;
# Strip the linebreak character at the end.
chomp $line;
# Do something with the line.
do_something($line);
# Perhaps bail out of the loop
if ($line =~ m/^ERROR/)
{
warn "Error on line $line_number - skipping rest of file";
last;
}
}
编辑:要获取行号,您可以省略$line_number
并使用$.
(请参阅http://perldoc.perl.org/perlvar.html)