在perl上的50GB文件上做一些正则表达式

时间:2015-04-08 11:55:33

标签: regex perl

如何在perl中传输50GB的大文件来为每一行(或块)执行一些正则表达式?我试过普通的香草

for $line (<FH>) {
   # do regex
}

我也尝试过Tie :: File和File :: Stream,但perl总是试图将整个文件加载到内存中,这根本不可能。

#!/usr/bin/perl

use IO::Handle;
use Tie::File;
use File::Stream;
#tie @array, 'Tie::File', $ARGV[0] or die "could not open file";

STDOUT->autoflush(1);

$file=$ARGV[0];
open(INFO, "< $file") or die("Could not open  file.");

print "opening ... \n";
my $stream = File::Stream->new(<INFO>);

#$out = $ARGV[1];
#open(my $OH, '>', $out) or die "Could not open file '$out' $!";
print "starting ... \n";
while (<$stream>)  {
    $line = $_;
    $line =~ s/\n/\[!BR!\]/g;
    $line =~ s/<page>/\n<page>/g;
    $line =~ s/<\/page>/<\/page>\n/g;
    print $line;

    #STDOUT->flush();
}

close(INFO);

2 个答案:

答案 0 :(得分:9)

正确的“普通香草”语法是

while (my $line = <FH>) { ...

你的for循环确实会导致Perl首先将整个文件读入内存。

答案 1 :(得分:1)

我建议使用此PerlMonks page上列出的方法。

以下是该页面的示例:

# Set the character which will be used to indicate the end of a line.
# This defaults to the system's end of line character, but it doesn't
# hurt to set it explicitly, just in case some other part of your code
# has altered it from the default.
local $/ = "\n";

# Open the file for read access:
open my $filehandle, '<', 'myfile.txt';

my $line_number = 0;

# Loop through each line:
while (defined($line = <$filehandle>))
{
  # The text of the line, including the linebreak
  # is now in the variable $line.

  # Keep track of line numbers
  $line_number++;

  # Strip the linebreak character at the end.
  chomp $line;

  # Do something with the line.
  do_something($line);

  # Perhaps bail out of the loop
  if ($line =~ m/^ERROR/)
  {
    warn "Error on line $line_number - skipping rest of file";
    last;
  }
}

编辑:要获取行号,您可以省略$line_number并使用$.(请参阅http://perldoc.perl.org/perlvar.html