我有一个大的输入文件:
d0 NoS19
s0 This movie has been regarded as the cream of Hong Kong gangster and copmovie.
s1 And has won 22 awards.
s2 But we all know awards don't mean a thing sometimes.
我想删除通配符,停用词然后阻止输入。这对我来说很好。我的问题是如何从通配符,词干和删除词中删除标识符,如d0,NoS19,s0,s1,s2等。
我使用了porter stemmer,并且有一个包含大量停用词的文件,
对于我的wilcard删除,这就是我所做的:
$reviewContent =~ tr/A-Z/a-z/; #transfer upper case to lower case
$reviewContent =~ s/[a-z_0-9\.]*\@[a-z_0-9\.]*/ /g;
$reviewContent =~ s/[^a-zA-Z\']/ /g;
$reviewContent =~ s/ +\'/ /g;
$reviewContent =~ s/\' +/ /g;
$reviewContent =~ s/[^\w.-]/ /g;
$reviewContent =~ s/[ ]+/ /g;
$reviewContent =~ s/^\s+//g;
有什么想法吗?
答案 0 :(得分:1)
也许,首先将每一行拆分为代码并进行评论,然后仅对评论进行操作:
my ($code, $comment) = split ' ', $reviewContent, 2;
if ($code !~ /^d/) { # I asume the header always starts with a "d".
$comment =~ s/[a-z_0-9\.]*\@[a-z_0-9\.]*/ /g;
$comment =~ s/[^a-zA-Z\']/ /g;
$comment =~ s/ +\'/ /g;
$comment =~ s/\' +/ /g;
$comment =~ s/[^\w.-]/ /g;
$comment =~ s/[ ]+/ /g;
$comment =~ s/^\s+//g;
}
print "$code $comment\n";