Question

我正在尝试使用我正在编写的网络爬虫的情感分类（不要问我重新发明轮子，至少就爬虫问题而言！）。我目前正在研究使用朴素贝叶斯，主要是因为存在perl模块使其更容易。但是，我在设置测试用例以确定好/坏电影评论时遇到了一些问题。

据我所知，我首先需要提出一组模块将用于训练的测试数据。我去了几个网站评论电影，下载了十几个不好的评论和十几个好评。我在每个文件中读取以生成单词列表，然后将其转换为单词频率的散列。然后我对一堆“未知”的评论做同样的处理（虽然我或多或少知道这种情绪）但我遇到了问题，我不确定我是否会以错误的方式解决这个问题！

以下是测试输出：

*** Processing good reviews
Reading set/good/anniehall.txt
Reading set/good/biglebowski.txt
Reading set/good/contact.txt
Reading set/good/eternalsunshine.txt
Reading set/good/harakiri.txt
Reading set/good/killing.txt
Reading set/good/lincoln.txt
Reading set/good/mulhollanddr.txt
Reading set/good/narayama.txt
Reading set/good/scarface.txt
Reading set/good/seven.txt
Reading set/good/shoah.txt
Reading set/good/spiritedaway.txt
*** Processing bad reviews
Reading set/bad/battlefieldearth.txt
Reading set/bad/charliesangels.txt
Reading set/bad/deathsmootchy.txt
Reading set/bad/deucebigaloweuro.txt
Reading set/bad/freddyfingered.txt
Reading set/bad/humancentipede.txt
Reading set/bad/jasonx.txt
Reading set/bad/north.txt
Reading set/bad/pootietang.txt
Reading set/bad/residentevilapocalypse.txt
Reading set/bad/savingsilverman.txt
Reading set/bad/slackers.txt
Reading set/bad/texaschainsaw.txt
*** Predicting unknown reviews
set/unknown/benjaminbutton.txt: $VAR1 = {
          'bad' => '1.06973342245912e-68',
          'good' => '1'
        };
set/unknown/epic.txt: $VAR1 = {
          'good' => '1',
          'bad' => '7.2271232924459e-35'
        };
set/unknown/hangoverpart3.txt: $VAR1 = {
          'good' => '1',
          'bad' => '1.08569835047604e-17'
        };
set/unknown/jacobsladder.txt: $VAR1 = {
          'good' => '1',
          'bad' => '9.31582505503138e-60'
        };
set/unknown/marleyme.txt: $VAR1 = {
          'good' => '1',
          'bad' => '5.57603799052706e-26'
        };
set/unknown/quantumofsolace.txt: $VAR1 = {
          'bad' => '2.40424666202666e-27',
          'good' => '1'
        };
set/unknown/thespirit.txt: $VAR1 = {
          'bad' => '2.47177895177767e-19',
          'good' => '1'
        };
set/unknown/twilight.txt: $VAR1 = {
          'good' => '1',
          'bad' => '9.77187340648713e-62'
        };

似乎它总是将未知数据标记为“好”！

这是程序本身：

use 5.010;
use strict;
use warnings;
use utf8;
use Data::Dumper;

BEGIN { push @INC, "../lib"; }
use Algorithm::NaiveBayes;

my $nb = Algorithm::NaiveBayes->new;

# For each file in each directory, retrieve a hash with each key being a
# unique word, and each value the associated frequency of that word.

# Start with scanning good reviews directory
say "*** Processing good reviews";
my @files = <set/good/*>;
foreach (@files) {
    next if ($_ =~ m/^\./); # ignore files beginning with .
    say "Reading $_";
    my %attr = hash_file($_);
    $nb->add_instance ( attributes => \%attr, label => 'good');
}

# Then scan bad reviews
say "*** Processing bad reviews";
@files = <set/bad/*>;
foreach (@files) {
    next if ($_ =~ m/^\./); # ignore files beginning with .
    say "Reading $_";
    my %attr = hash_file($_);
    $nb->add_instance ( attributes => \%attr, label => 'bad');
}

# Train, and cross fingers
$nb->train;

# Test unknown reviews
say "*** Predicting unknown reviews";
@files = <set/unknown/*>;
foreach (@files) {
    next if ($_ =~ m/^\./); # ignore files beginning with .
    print "$_: ";
    my %attr = hash_file($_);
    my $result = $nb->predict(attributes => \%attr);
    print Dumper($result);
}


# Subroutine that takes a file path and returns a hash of the word frequencies
sub hash_file {
    my ($file) = @_;
    my %words;

    my @word_list;
    open FILE, $file or die $!;
    while(<FILE>){
        chomp;
        push @word_list, split;
    }
    close FILE;

    foreach (@word_list){
        $_ =~ s/[[:punct:]]//g; # Remove punctuation
        next if ($_ eq '');

        # Increment frequency if word is in hash, or add to hash
        if (exists $words{$_} ){
            $words{$_}++;
        } else {
            $words{$_} = 1;
        }

    }
    return %words;
}

我希望代码中有一些明显的错误，但我已经检查了哈希子程序，它似乎正在吐出正确的哈希值。我能想到的另一件事是，我可能没有使用足够的数据来训练它？也许我的整个方法都被误导了？

感谢您的任何见解

使用朴素贝叶斯在Perl中进行情感分类

0 个答案: