Question

我在最新版本的Strawberry Perl for Windows下遇到以下代码问题：我想读取目录中的所有文本文件并处理其内容。我目前没有看到一种逐行处理它们的方法，因为我想对文件内容进行的一些更改会跨越换行符。处理主要涉及删除大块文件（在我的示例代码中，它只是一行，但我理想情况下运行几个类似的正则表达式，每个都从文件中删除东西）

我在大量文件（＆gt; 10,000）上运行此脚本，并且它总是因“内存不足”而崩溃！＆＃34;一个特定文件上的消息，大于400 MB。问题在于，当我编写一个只处理ONE文件的程序时，代码工作正常。

机器有8 GB RAM，所以我认为物理RAM不是问题。

我通读了有关内存问题的其他帖子，但没有找到任何可以帮助我实现目标的内容。

任何人都可以建议我需要更改以使程序正常工作，即使其更具内存效率或以某种方式回避问题吗？

use strict;
use warnings;
use Path::Iterator::Rule;
use utf8;

use open ':std', ':encoding(utf-8)';

my $doc_rule = Path::Iterator::Rule->new;
$doc_rule->name('*.txt'); # only process text files
$doc_rule->max_depth(3); # don't recurse deeper than 3 levels
my $doc_it = $doc_rule->iter("C:\Temp\");
while ( my $file = $doc_it->() ) { # go through all documents found
    print "Stripping $file\n";

    # read in file
    open (FH, "<", $file) or die "Can't open $file for read: $!";
    my @lines;
    while (<FH>) { push (@lines, $_) }; # slurp entire file
    close FH or die "Cannot close $file: $!";

    my $lines = join("", @lines); # put entire file into one string

    $lines =~ s/<DOCUMENT>\n<TYPE>EX-.*?\n<\/DOCUMENT>//gs; #perform the processing

    # write out file
    open (FH, ">", $file) or die "Can't open $file for write: $!";
    print FH $lines; # dump entire file
    close FH or die "Cannot close $file: $!";
}

Answer 1

逐行处理文件：

while ( my $file = $doc_it->() ) { # go through all documents found
    print "Stripping $file\n";

    open (my $infh, "<", $file) or die "Can't open $file for read: $!";
    open (my $outfh, ">", $file . ".tmp") or die "Can't open $file.tmp for write: $!";

    while (<$infh>) {
       if ( /<DOCUMENT>/ ) {
           # append the next line to test for TYPE
           $_ .= <$infh>;
           if (/<TYPE>EX-/) {
              # document type is excluded, now loop through 
              # $infh until the closing tag is found.
              while (<$infh>) { last if m|</DOCUMENT>|; }

              # jump back to the <$infh> loop to resume
              # processing on the next line after </DOCUMENT>
              next;
           }
           # if we've made it this far, the document was not excluded
           # fall through to print both lines
       }
       print $outfh $_;
    }

    close $outfh or die "Cannot close $file: $!";
    close $infh or die "Cannot close $file: $!";
    unlink $file;
    rename $file.'.tmp', $file; 
}

Answer 2

您可以同时在内存中保留该文件的两个完整副本@lines和$lines。你可以考虑改为：

open (my $FH, "<", $file) or die "Can't open $file for read: $!";
$FH->input_record_separator(undef); # slurp entire file
my $lines = <$FH>;
close $FH or die "Cannot close $file: $!";

在足够过时的Perl版本上，您可能需要明确use IO::Handle。

另请注意：我已经从裸字版本切换到词汇文件句柄。我认为你并没有努力与Perl v4兼容。

当然，如果将内存需求减少一半是不够的，你可以随时遍历文件......

Answer 3

使用正则表达式处理XML是容易出错且效率低下的，因为代码会将整个文件作为字符串显示出来。要处理XML，您应该使用XML解析器。特别是，您需要一个SAX解析器，它将一次处理XML，而不是读取整个文件的DOM解析器。

我将按原样回答你的问题，因为知道如何逐行工作有一些价值。

如果可以避免，请不要将整个文件读入内存。逐行工作。您的任务似乎是出于原因从XML文件中删除少量行。 <DOCUMENT>\n<TYPE>EX-和<\/DOCUMENT>之间的所有内容。我们可以通过保持一点状态来逐行完成。

use autodie;

open (my $infh, "<", $file);
open (my $outfh, ">", "$file.tmp");

my $in_document = 0;
my $in_type_ex  = 0;
while( my $line = <$infh> ) {
    if( $line =~ m{<DOCUMENT>\n}i ) {
        $in_document = 1;
        next;
    } 
    elsif( $line =~ m{</DOCUMENT>}i ) {
        $in_document = 0;
        next;
    }
    elsif( $line =~ m{<TYPE>EX-}i ) {
        $in_type_ex = 1;
        next;
    }
    elsif( $in_document and $in_type_ex ) {
        next;
    }
    else {
        print $outfh $line;
    }
}

rename "$file.tmp", $file;

使用临时文件可以在构建替换文件时读取文件。

当然，如果XML文档没有被格式化，那么这将失败（我帮助将/i标志添加到正则表达式以允许小写标签），你应该真正使用SAX XML解析器。

Answer 4

在Windows Server 2013上使用Perl 5.10.1处理一个有点大（1.2G）的文件时，我注意到了这一点

foreach my $line (<LOG>) {}

因内存不足而失败，而

while (my $line = <LOG>) {}

在一个简单的脚本中工作，只需运行一些正则表达式并打印我感兴趣的行。

Perl＆＃34;内存不足＆＃34;用大文本文件

4 个答案: