如何在Perl中保留禁用词和通配符删除中的某些单词

时间:2015-01-19 16:18:37

标签: perl

我有一个大的输入文件:

d0 NoS19
s0 This movie has been regarded as the cream of Hong Kong gangster and copmovie.
s1 And has won 22 awards.
s2 But we all know awards don't mean a thing sometimes.

我想删除通配符,停用词然后阻止输入。这对我来说很好。我的问题是如何从通配符,词干和删除词中删除标识符,如d0,NoS19,s0,s1,s2等。

我使用了porter stemmer,并且有一个包含大量停用词的文件,

对于我的wilcard删除,这就是我所做的:

$reviewContent =~ tr/A-Z/a-z/; #transfer upper case to lower case

$reviewContent =~ s/[a-z_0-9\.]*\@[a-z_0-9\.]*/ /g; 
$reviewContent =~ s/[^a-zA-Z\']/ /g; 
$reviewContent =~ s/ +\'/ /g; 
$reviewContent =~ s/\' +/ /g;
$reviewContent  =~ s/[^\w.-]/ /g; 
$reviewContent =~ s/[ ]+/ /g; 
$reviewContent =~ s/^\s+//g;    

有什么想法吗?

1 个答案:

答案 0 :(得分:1)

也许,首先将每一行拆分为代码并进行评论,然后仅对评论进行操作:

my ($code, $comment) = split ' ', $reviewContent, 2;
if ($code !~ /^d/) {         # I asume the header always starts with a "d".
    $comment =~ s/[a-z_0-9\.]*\@[a-z_0-9\.]*/ /g;
    $comment =~ s/[^a-zA-Z\']/ /g;
    $comment =~ s/ +\'/ /g;
    $comment =~ s/\' +/ /g;
    $comment =~ s/[^\w.-]/ /g;
    $comment =~ s/[ ]+/ /g;
    $comment =~ s/^\s+//g;
}
print "$code $comment\n";