Perl从多个文件中删除停用词

时间:2012-11-11 15:09:38

标签: perl file stop-words

我已经阅读了很多关于如何从文件中删除停用词的表单,我的代码删除了许多其他内容,但我想要包含停用词。这是我到达的距离,但我不知道我错过了什么。请建议

use Lingua::StopWords qw(getStopWords);
my $stopwords = getStopWords('en');

chdir("c:/perl/input");
@files = <*>;

foreach $file (@files) 
  {
    open (input, $file);

    while (<input>) 
      {
        open (output,">>c:/perl/normalized/".$file);
    chomp;
    #####What should I write here to remove the stop words#####
    $_ =~s/<[^>]*>//g;
    $_ =~ s/\s\.//g;
    $_ =~ s/[[:punct:]]\.//g;
    if($_ =~ m/(\w{4,})\./)
    {
    $_ =~ s/\.//g;
    }
    $_ =~ s/^\.//g;
    $_ =~ s/,/' '/g;
    $_ =~ s/\(||\)||\\||\/||-||\'//g;

    print output "$_\n";

      }
   }

close (input);
close (output);

2 个答案:

答案 0 :(得分:2)

停用词是%$stopwords的键,其值为1,即:

@stopwords = grep { $stopwords->{$_} } (keys %$stopwords);

停用词可能恰好是%$stopwords的键,但根据Lingua::StopWords文档,您还需要检查与该键相关联的值。

一旦你有了停用词,就可以用以下代码删除它们:

# remove all occurrences of @stopwords from $_

for my $w (@stopwords) {
  s/\b\Q$w\E\b//ig;
}

请注意使用\Q...\E引用可能出现在停用词中的任何正则表达式元字符。即使停用词不太可能包含元字符,但如果您想在正则表达式中表示文字字符串,这是一个很好的做法。

我们还使用\b来匹配单词边界。这有助于确保我们不会在另一个单词的中间出现停止词。希望这对你有用 - 这很大程度上取决于你的输入文字是什么样的 - 即你是否有标点字符等。

答案 1 :(得分:0)

# Always use these in your Perl programs.
use strict;
use warnings;

use File::Basename qw(basename);
use Lingua::StopWords qw(getStopWords);

# It's often better to build scripts that take their input
# and output locations as command-line arguments rather than
# being hard-coded in the program.
my $input_dir   = shift @ARGV;
my $output_dir  = shift @ARGV;
my @input_files = glob "$input_dir/*";

# Convert the hash ref of stop words to a regular array.
# Also quote any regex characters in the stop words.
my @stop_words  = map quotemeta, keys %{getStopWords('en')};

for my $infile (@input_files){
    # Open both input and output files at the outset.
    # Your posted code reopened the output file for each line of input.
    my $fname   = basename $infile;
    my $outfile = "$output_dir/$fname";
    open(my $fh_in,  '<', $infile)  or die "$!: $infile";
    open(my $fh_out, '>', $outfile) or die "$!: $outfile";

    # Process the data: you need to iterate over all stop words
    # for each line of input.
    while (my $line = <$fh_in>){
        $line =~ s/\b$_\b//ig for @stop_words;
        print $fh_out $line;
    }

    # Close the files within the processing loop, not outside of it.
    close $fh_in;
    close $fh_out;
}