Perl使用file2

时间:2015-10-12 15:24:48

标签: regex linux perl

我使用perl脚本删除文本中的所有停用词。停用词一个接一个地存储。我使用的是Mac OSX命令行,并正确安装了perl。

此脚本无法正常运行且存在边界问题。

#!/usr/bin/env perl -w
# usage: script.pl words text >newfile
use English;

# poor man's argument handler
open(WORDS, shift @ARGV) || die "failed to open words file: $!";
open(REPLACE, shift @ARGV) || die "failed to open replacement file: $!";

my @words;
# get all words into an array
while ($_=<WORDS>) { 
  chop; # strip eol
  push @words, split; # break up words on line
}

# (optional)
# sort by length (makes sure smaller words don't trump bigger ones); ie, "then" vs "the"
@words=sort { length($b) <=> length($a) } @words;

# slurp text file into one variable.
undef $RS;
$text = <REPLACE>;

# now for each word, do a global search-and-replace; make sure only words are replaced; remove possible following space.
foreach $word (@words) { 
     $text =~ s/\b\Q$word\E\s?//sg;
}

# output "fixed" text
print $text;

sample.txt的

$ cat sample.txt
how about i decide to look at it afterwards what
across do you think is it a good idea to go out and about i 
think id rather go up and above

stopWords.txt中

I
a
about
an
are
as
at
be
by
com
for
from
how
in
is
it
..

输出:

$ ./remove.pl stopwords.txt sample.txt 
i decide look fterwards cross do you think good idea go out d i 
think id rather go up d bove

正如您所看到的,它后来使用as fterwards替换。认为它是一个正则表达式的问题。请有人帮我快速补丁吗?感谢所有的帮助:J

2 个答案:

答案 0 :(得分:1)

$word的两侧使用字边界。目前,您只是在开始时检查它。

\s?到位的情况下,您不需要\b条件:

$text =~ s/\b\Q$word\E\b//sg;

答案 1 :(得分:0)

你的正则表达式不够严格。

$text =~ s/\b\Q$word\E\s?//sg;

$worda时,该命令实际上是s/\ba\s?//sg。这意味着,删除所有出现的以a开头,后跟零个或多个空格的新单词。在afterwards中,这将成功匹配第一个a

您可以通过使用另一个\b结束字词来使匹配更加严格。像

$text =~ s/\b\Q$word\E\b\s?//sg;