处理文件时Perl的“Out of memory”错误

时间:2014-09-25 18:08:37

标签: perl

我使用Perl和Mojo::DOM来处理大量文本文件。我需要计算以某些后缀结尾的所有单词的出现次数。

运行此代码会继续为批量超过40个文件返回out of memory错误消息。

有没有办法比我在下面做的更有效地完成这项任务(减少内存使用量)?

#!/software/perl512/bin/perl

use strict;
use warnings;
use autodie;

use Mojo::DOM;

my $path = "/data/10K/2012";
chdir($path) or die "Cant chdir to $path $!";

# This program counts the total number of suffixes of a form in a given document.

my @sequence;
my %sequences;
my $file;
my $fh;
my @output;

# Reading in the data.
for my $file (<*.txt>) {

   my %affixes;
   my %word_count;

   my $data = do {
      open my $fh, '<', $file;
      local $/;    # Slurp mode
      <$fh>;
   };

   my $dom  = Mojo::DOM->new($data);
   my $text = $dom->all_text();

   for (split /\s+/, $text) {
      if ($_ =~ /[a-zA-Z]+(ness|ship|dom|ance|ence|age|cy|tion|hood|ism|ment|ure|tude|ery|ity|ial)\b/ ) {
         ++$affixes{"affix_count"};
      }
      ++$word_count{"word_count"};
   }

   my $output = join ",", $file, $affixes{"affix_count"}, $word_count{"word_count"};

   push @output, ($output);
}

@output = sort @output;

open(my $fh3, '>', '/home/usr16/rcazier/PerlCode/affix_count.txt');
foreach (@output) {
   print $fh3 "$_\n ";
}
close $fh3;

1 个答案:

答案 0 :(得分:1)

这就像我可以找到解决方案一样。它包含了评论中的所有要点,并通过保留任何HTML标记来解决&#34; Out of memory&#34; 错误。它还会将结果保留为未排序,因为原始代码并没有真正进行任何有用的排序。

由于您寻找后缀词的方式,我认为将HTML标记留在文本文件中的可能性很大,会严重影响您的结果。

#!/software/perl512/bin/perl

use strict;
use warnings;
use 5.010;
use autodie;

# Build and compile a regex that will match any of the suffixes that interest
# us, for later use in testing each "word" in the input file
#
my $suffix_re = do {
   my @suffixes  = qw/ ness ship dom ance ence age cy tion hood ism ment ure tude ery ity ial /;
   my $alternation = join '|', @suffixes;
   qr/ (?: $alternation ) /xi;
};

# Set the directory that we want to examine. `autodie` will check the success
# of `chdir` for us
#
my $path = '/data/10K/2012';  
chdir $path;

# Process every file with a `txt` file type
#
for my $filename ( grep -f, glob('*.txt') ) {

   warn qq{Processing "$filename"\n};

   open my ($fh), '<', $filename;

   my ($suffixes, $word_count) = (0, 0);

   while (<$fh>) {
      for (split) {
         ++$word_count;
         ++$suffixes if /\A[a-z]+$suffix_re\z/i;
      }
   }

   say join ',', $filename, $suffixes, $word_count if $suffixes;
}