我使用Perl和Mojo::DOM
来处理大量文本文件。我需要计算以某些后缀结尾的所有单词的出现次数。
运行此代码会继续为批量超过40个文件返回out of memory
错误消息。
有没有办法比我在下面做的更有效地完成这项任务(减少内存使用量)?
#!/software/perl512/bin/perl
use strict;
use warnings;
use autodie;
use Mojo::DOM;
my $path = "/data/10K/2012";
chdir($path) or die "Cant chdir to $path $!";
# This program counts the total number of suffixes of a form in a given document.
my @sequence;
my %sequences;
my $file;
my $fh;
my @output;
# Reading in the data.
for my $file (<*.txt>) {
my %affixes;
my %word_count;
my $data = do {
open my $fh, '<', $file;
local $/; # Slurp mode
<$fh>;
};
my $dom = Mojo::DOM->new($data);
my $text = $dom->all_text();
for (split /\s+/, $text) {
if ($_ =~ /[a-zA-Z]+(ness|ship|dom|ance|ence|age|cy|tion|hood|ism|ment|ure|tude|ery|ity|ial)\b/ ) {
++$affixes{"affix_count"};
}
++$word_count{"word_count"};
}
my $output = join ",", $file, $affixes{"affix_count"}, $word_count{"word_count"};
push @output, ($output);
}
@output = sort @output;
open(my $fh3, '>', '/home/usr16/rcazier/PerlCode/affix_count.txt');
foreach (@output) {
print $fh3 "$_\n ";
}
close $fh3;
答案 0 :(得分:1)
这就像我可以找到解决方案一样。它包含了评论中的所有要点,并通过保留任何HTML标记来解决&#34; Out of memory&#34; 错误。它还会将结果保留为未排序,因为原始代码并没有真正进行任何有用的排序。
由于您寻找后缀词的方式,我认为将HTML标记留在文本文件中的可能性很大,会严重影响您的结果。
#!/software/perl512/bin/perl
use strict;
use warnings;
use 5.010;
use autodie;
# Build and compile a regex that will match any of the suffixes that interest
# us, for later use in testing each "word" in the input file
#
my $suffix_re = do {
my @suffixes = qw/ ness ship dom ance ence age cy tion hood ism ment ure tude ery ity ial /;
my $alternation = join '|', @suffixes;
qr/ (?: $alternation ) /xi;
};
# Set the directory that we want to examine. `autodie` will check the success
# of `chdir` for us
#
my $path = '/data/10K/2012';
chdir $path;
# Process every file with a `txt` file type
#
for my $filename ( grep -f, glob('*.txt') ) {
warn qq{Processing "$filename"\n};
open my ($fh), '<', $filename;
my ($suffixes, $word_count) = (0, 0);
while (<$fh>) {
for (split) {
++$word_count;
++$suffixes if /\A[a-z]+$suffix_re\z/i;
}
}
say join ',', $filename, $suffixes, $word_count if $suffixes;
}