我正在实施一个朴素的贝叶斯分类算法。在我的训练集中,我在单独的文件中有许多摘要。我想使用N-gram来获得术语频率权重,但代码不会占用多个文件。
我编辑了我的代码,现在我得到的错误是
cant call method tscore on an undefined value
。为了检查这一点,我打印了@ngrams
,它显示了像hash0*29G45
之类的垃圾值或类似的东西。
#!c:\perl\bin\perl.exe -w
use warnings;
use Algorithm::NaiveBayes;
use Lingua::EN::Splitter qw(words);
use Lingua::StopWords qw(getStopWords);
use Lingua::Stem;
use Algorithm::NaiveBayes;
use Lingua::EN::Ngram;
use Data::Dumper;
use Text::Ngram;
use PPI::Tokenizer;
use Text::English;
use Text::TFIDF;
use File::Slurp;
my $pos_file = 'D:\aminoacids';
my $neg_file = 'D:\others';
my $test_file = 'D:\testfiles';
my @vectors = ();
my $categorizer = Algorithm::NaiveBayes->new;
my @files = <$pos_file/*>;
my @ngrams;
for my $filename (@files) {
open(FH, $filename);
my $ngram = Lingua::EN::Ngram->new($filename);
my $tscore = $ngram->tscore;
foreach (sort { $$tscore{$b} <=> $$tscore{$a} } keys %$tscore) {
print "$$tscore{ $_ }\t" . "$_\n";
}
my $trigrams = $ngram->ngram(2);
foreach my $trigram (sort { $$trigrams{$b} <=> $$trigrams{$a} } keys %$trigrams) {
print $$trigrams{$trigram}, "\t$trigram\n";
}
my %positive;
$positive{$_}++ for @files;
$categorizer->add_instance(
attributes => \%positive,
label => 'positive'
);
}
close FH;
答案 0 :(得分:1)
你的代码<$pos_file/*>
应该可以正常工作(感谢@borodir),但是,这里有一个替代方案,以免弄乱历史。
尝试
opendir (DIR, $directory) or die $!;
然后
while (my $filename = readdir(DIR)) {
open ( my $fh, $filename );
# work with filehandle
close $fh;
}
closedir DIR;
如果在列表上下文中调用,readdir应该为您提供文件列表:
my @filenames = readdir(DIR);
# you could call that function you wanted to call with this list, file would need to be
# opened still, though
另一点:
如果要传递对数组的引用,请执行以下操作:
function( list => \@stems );
# thus, your ngram line should probably rather be
my $ngram = Lingua::EN::Ngram->new (file => \@stems );
但是,Lingua :: EN :: Ngram的文档只讨论文件的标量等等,它似乎并不期望输入数组。 (例外是'交叉'方法)
所以你必须把它放在一个循环中并循环,或者使用map
my @ngrams = map{ Lingua::EN::Ngram->new( file => $_ ) }@filenames
似乎没有必要先在文件句柄中打开,Ngram自己做。
如果您更喜欢循环:
my @ngrams;
for my $filename ( @filenames ){
push @ngrams, Lingua::EN::Ngram->new( file => $filename );
}
我想现在我得到了你真正想做的事情。
得到tscore:你写了$tscore = $ngram->tscore
,但是$ ngram不再定义了。
不确定如何获得单个单词的tscore。 (“文字中的单词的意义”)表示文本。
因此:不是为每个单词制作一个ngram,而是为每个句子或每个文件制作一个ngram。 然后,您可以确定该句子或文件(文本)中该单词的t分数。
for my $filename ( @files ){
my $ngram = Lingua::EN::Ngram->new( file => $filename );
my $tscore = $ngram->tscore();
# tscore returns a hash reference. Keys are bigrams, values are tscores
# now you can do with the tscore what you like. Note that for arbitrary length,
# tscore will not work. This you would have to do yourself.