Question

我有文本文件，我需要从中删除停用词。我将停用词存储在文本文件中。我将“stop-word”文本文件加载到我的Perl脚本中，并将停用词存储在名为“stops”的数组中。

目前我正在加载一组不同的文本文件，我将它们存储在一个单独的数组中，然后进行模式匹配，以查看是否有任何单词确实是停用词。我可以打印停用词并知道文件中出现了哪些，但如何从文本文件中删除它们并存储新的文本文件以使其没有停用词？

即停用词：该一个至的和成

文字档案： “那个女孩正在开车撞向一个男人”

产生的文件：女孩正在驾驶坠毁的男子

我将文件加载到：

$dirtoget = "/Users/j/temp/";
opendir( IMD, $dirtoget ) || die("Cannot open directory");`
@thefiles = readdir(IMD);`

foreach $f (@thefiles) {
if ( $f =~ m/\.txt$/ ) {

    open( FILE, "/Users/j/temp/$f" ) or die "Cannot open FILE";

    while (<FILE>) {
        @file = <FILE>;

这是模式匹配循环：

  foreach $word(split) {
                foreach $x (@stop) {
                   if  ($x =~ m/\b\Q$word\E\b/) {
                 $word='';
                        print $word,"\n";

将$word设为null。

或者我能做到：

    $word = '' if exists $stops{$word};

我只是不确定如何将输出文件设置为不再包含匹配的单词。存储数组中不匹配的单词并将它们输出到文件是愚蠢的吗？

Answer 1

可以就地覆盖文件，但这很麻烦。 Unix的做法是将非停用词输出到标准输出（默认为print），重定向

./remove_stopwords.pl textfile.txt > withoutstopwords.txt

然后继续文件withoutstopwords.txt。这也允许在管道中使用该程序。

Answer 2

更短的：

use strict;
use warnings;
use English qw<$LIST_SEPARATOR $NR>;

my $stop_regex 
    = do { 
        local $LIST_SEPARATOR = '\\E|\\Q';
        eval "qr/\\b(\\Q@{stop}\\E)\\b/";
    };
@ARGV = glob( '/Users/j/temp/*.txt' );
while ( <> ) { 
    next unless m/$stop_regex/;
    print "Stop word '$1' found at $ARGV line $NR\n";
}

你想用这些词做什么做？如果您想要替换它们，那么您可以这样做：

use English qw<$INPLACE_EDIT $LIST_SEPARATOR $NR>;
local $INPLACE_EDIT = 'bak';

...
while ( <> ) { 
    if ( m/$stop_regex/ )
        s/$stop_regex/$something_else/g;
    }
    print;
}

当$INPLACE_EDIT处于活动状态时，perl会将打印转储到'.bak'文件中，当它移动到下一个文件时，它会将.bak写入原始文件。如果那就是你想要做的事。

Answer 3

您可以使用substitution operator删除文件中的字词：

use warnings;
use strict;

my @stop = qw(foo bar);
while (<DATA>) {
    my $line = $_;
    $line =~ s/\b$_\b//g for @stop;
    print $line;
}

__DATA__
here i am
with a foo
and a bar too
lots of foo foo food

打印：

here i am
with a
and a  too
lots of   food

删除停用词并保存新文件

3 个答案: