如何使用Lingua :: EN :: Ngram多个文件

时间:2014-11-15 10:16:55

标签: perl n-gram

我正在实施一个朴素的贝叶斯分类算法。在我的训练集中,我在单独的文件中有许多摘要。我想使用N-gram来获得术语频率权重,但代码不会占用多个文件。

我编辑了我的代码,现在我得到的错误是 cant call method tscore on an undefined value。为了检查这一点,我打印了@ngrams,它显示了像hash0*29G45之类的垃圾值或类似的东西。

  #!c:\perl\bin\perl.exe -w

  use warnings;

  use Algorithm::NaiveBayes;
  use Lingua::EN::Splitter qw(words);
  use Lingua::StopWords qw(getStopWords);
  use Lingua::Stem;
  use Algorithm::NaiveBayes;
  use Lingua::EN::Ngram;
  use Data::Dumper;
  use Text::Ngram;
  use PPI::Tokenizer;
  use Text::English;
  use Text::TFIDF;
  use File::Slurp;

  my $pos_file  = 'D:\aminoacids';
  my $neg_file  = 'D:\others';
  my $test_file = 'D:\testfiles';
  my @vectors   = ();

  my $categorizer = Algorithm::NaiveBayes->new;

  my @files = <$pos_file/*>;
  my @ngrams;
  for my $filename (@files) {

    open(FH, $filename);

    my $ngram = Lingua::EN::Ngram->new($filename);

    my $tscore = $ngram->tscore;

    foreach (sort { $$tscore{$b} <=> $$tscore{$a} } keys %$tscore) {
      print "$$tscore{ $_ }\t" . "$_\n";
    }

    my $trigrams = $ngram->ngram(2);

    foreach my $trigram (sort { $$trigrams{$b} <=> $$trigrams{$a} } keys %$trigrams) {
      print $$trigrams{$trigram}, "\t$trigram\n";
    }

    my %positive;

    $positive{$_}++ for @files;

    $categorizer->add_instance(
      attributes => \%positive,
      label      => 'positive'
    );
  }

  close FH;

1 个答案:

答案 0 :(得分:1)

你的代码<$pos_file/*>应该可以正常工作(感谢@borodir),但是,这里有一个替代方案,以免弄乱历史。 尝试

opendir (DIR, $directory) or die $!;

然后

 while (my $filename = readdir(DIR)) {

    open ( my $fh, $filename );

    # work with filehandle

    close $fh;

}

closedir DIR;

如果在列表上下文中调用,readdir应该为您提供文件列表:

my @filenames = readdir(DIR);
# you could call that function you wanted to call with this list, file would need to be 
# opened still, though

另一点:

如果要传递对数组的引用,请执行以下操作:

function( list => \@stems );
# thus, your ngram line should probably rather be

my $ngram = Lingua::EN::Ngram->new (file => \@stems );

但是,Lingua :: EN :: Ngram的文档只讨论文件的标量等等,它似乎并不期望输入数组。 (例外是'交叉'方法)

所以你必须把它放在一个循环中并循环,或者使用map

my @ngrams = map{ Lingua::EN::Ngram->new( file => $_ ) }@filenames

似乎没有必要先在文件句柄中打开,Ngram自己做。

如果您更喜欢循环:

my @ngrams;
for my $filename ( @filenames ){ 
   push @ngrams, Lingua::EN::Ngram->new( file => $filename );
}

我想现在我得到了你真正想做的事情。

得到tscore:你写了$tscore = $ngram->tscore,但是$ ngram不再定义了。

不确定如何获得单个单词的tscore。 (“文字中的单词的意义”)表示文本。

因此:不是为每个单词制作一个ngram,而是为每个句子或每个文件制作一个ngram。 然后,您可以确定该句子或文件(文本)中该单词的t分数。

for my $filename ( @files ){
   my $ngram = Lingua::EN::Ngram->new( file => $filename );

   my $tscore = $ngram->tscore(); 
   # tscore returns a hash reference. Keys are bigrams, values are tscores
   # now you can do with the tscore what you like. Note that for arbitrary length,
   # tscore will not work. This you would have to do yourself.