Question

我修补了一些Perl脚本，用于从一批文档中取出每个单词，消除所有停用词，阻止剩余的单词，并创建包含每个词干及其出现频率的哈希。然而，经过几分钟的工作后，我得到了“内存不足！”命令窗口中的消息。有没有更有效的方法来实现预期的结果，或者我只是需要找到一种方法来访问更多的内存？

#!/usr/bin/perl
use strict;
use warnings;
use Lingua::EN::StopWords qw(%StopWords);
use Lingua::Stem qw(stem);
use Mojo::DOM;

my $path = "U:/Perl/risk disclosures/2006-28";
chdir($path) or die "Cant chdir to $path $!";

# This program counts the total number of unique sentences in a 10-K and enumerates the frequency     of each one.

my @sequence;
my %sequences;
my $fh;

# Opening each file and reading its contents.
for my $file (<*.htm>) {
    my $data = do {
        open my $fh, '<', $file;
        local $/;    # Slurp mode
        <$fh>;
    };
    my $dom  = Mojo::DOM->new($data);
    my $text = $dom->all_text();
    for ( split /\s+/, $text ) {
        # Here eliminating stop words.
        while ( !$StopWords{$_} ) {
            # Here retaining only the word stem.
            my $stemmed_word = stem($_);
            ++$sequences{"$stemmed_word"};
        }
    }
}

Answer 1

如果单词不在%StopWords中，则输入无限循环：

while ( !$StopWords{$_} ) {
    my $stemmed_word = stem($_);
    ++$sequences{"$stemmed_word"};

    # %StopWords hasn't changed, so $_ is still not in it
}

根本没有理由在这里使用循环。您已使用for循环一次检查一个字。一个单词要么是一个单词，要么就是一个单词，所以你只需要检查一次。

我会做更多类似的事情：

my $dom  = Mojo::DOM->new($data);
my @words = split ' ', $dom->all_text();

foreach my $word (@words) {
    next if defined $StopWords{$word};

    my $stemmed_word = stem $word;
    ++$sequences{$stemmed_word};
}

除了用

替换内部while循环外

next if defined $StopWords{$word};

我也是

删除了中间$text变量，因为看起来你真的只关心单个单词，而不是整个文本块
在for中添加了一个显式循环变量。各种函数会自动更改$_以避免意外的副作用，我使用显式循环变量来处理除say for @array;
从++$sequences{"$stemmed_word"};

在Perl中结合词干和消除哈希词的大多数内存有效的方法？

1 个答案: