我已经阅读了很多关于如何从文件中删除停用词的表单,我的代码删除了许多其他内容,但我想要包含停用词。这是我到达的距离,但我不知道我错过了什么。请建议
use Lingua::StopWords qw(getStopWords);
my $stopwords = getStopWords('en');
chdir("c:/perl/input");
@files = <*>;
foreach $file (@files)
{
open (input, $file);
while (<input>)
{
open (output,">>c:/perl/normalized/".$file);
chomp;
#####What should I write here to remove the stop words#####
$_ =~s/<[^>]*>//g;
$_ =~ s/\s\.//g;
$_ =~ s/[[:punct:]]\.//g;
if($_ =~ m/(\w{4,})\./)
{
$_ =~ s/\.//g;
}
$_ =~ s/^\.//g;
$_ =~ s/,/' '/g;
$_ =~ s/\(||\)||\\||\/||-||\'//g;
print output "$_\n";
}
}
close (input);
close (output);
答案 0 :(得分:2)
停用词是%$stopwords
的键,其值为1,即:
@stopwords = grep { $stopwords->{$_} } (keys %$stopwords);
停用词可能恰好是%$stopwords
的键,但根据Lingua::StopWords
文档,您还需要检查与该键相关联的值。
一旦你有了停用词,就可以用以下代码删除它们:
# remove all occurrences of @stopwords from $_
for my $w (@stopwords) {
s/\b\Q$w\E\b//ig;
}
请注意使用\Q...\E
引用可能出现在停用词中的任何正则表达式元字符。即使停用词不太可能包含元字符,但如果您想在正则表达式中表示文字字符串,这是一个很好的做法。
我们还使用\b
来匹配单词边界。这有助于确保我们不会在另一个单词的中间出现停止词。希望这对你有用 - 这很大程度上取决于你的输入文字是什么样的 - 即你是否有标点字符等。
答案 1 :(得分:0)
# Always use these in your Perl programs.
use strict;
use warnings;
use File::Basename qw(basename);
use Lingua::StopWords qw(getStopWords);
# It's often better to build scripts that take their input
# and output locations as command-line arguments rather than
# being hard-coded in the program.
my $input_dir = shift @ARGV;
my $output_dir = shift @ARGV;
my @input_files = glob "$input_dir/*";
# Convert the hash ref of stop words to a regular array.
# Also quote any regex characters in the stop words.
my @stop_words = map quotemeta, keys %{getStopWords('en')};
for my $infile (@input_files){
# Open both input and output files at the outset.
# Your posted code reopened the output file for each line of input.
my $fname = basename $infile;
my $outfile = "$output_dir/$fname";
open(my $fh_in, '<', $infile) or die "$!: $infile";
open(my $fh_out, '>', $outfile) or die "$!: $outfile";
# Process the data: you need to iterate over all stop words
# for each line of input.
while (my $line = <$fh_in>){
$line =~ s/\b$_\b//ig for @stop_words;
print $fh_out $line;
}
# Close the files within the processing loop, not outside of it.
close $fh_in;
close $fh_out;
}