无法获得加权余弦相似性

时间:2016-07-08 13:59:33

标签: perl module

我正试图获得两个文件的加权余弦相似度。我正在使用Text::DocumentText::DocumentCollection。我的代码似乎有效,但它没有像我预期的那样返回一个数字。

这是我的代码

use strict;
use warnings;

use Text::Document;
use Text::DocumentCollection;

my $newfile  = shift @ARGV;
my $newfile2 = shift @ARGV;

##This is in another file.
my $t1 = countFreq($newfile);
my $t2 = countFreq($newfile2);

my $collection = Text::DocumentCollection->new(file => 'coll.db');
$collection->Add("One", $t1);
$collection->Add("Two", $t2);

my $wSim = $t1->WeightedCosineSimilarity( $t2,
    \&Text::DocumentCollection::IDF,
    $collection
);

print "\nWeighted Cosine Sim is: $wSim\n";

所有这些都返回Weighted Cosine Sim is:,没有冒号后面的任何内容。

以下是countFreq的代码:

sub countFreq{
my ($file) = @_;

my $t1 = Text::Document->new();

open (my $info, $file) or die "Could not open  file.";
    while (my $line = <$info>) {
        chomp $line;
        $line =~ s/[[:punct:]]//g;
    foreach my $str (split /\s+/, $line) {
        if (!defined $sp{lc($str)}) {
            $t1 -> AddContent ($str);
    }
}
}
    return $t1;
}

1 个答案:

答案 0 :(得分:0)

更新

这是一个工作正常的示例程序。它基于toolic's suggestion来查看分发中的测试代码以获取灵感

我期待测试的灵敏度要低得多,所以我从两个截然不同的文本源中得到了零。此示例将三个短句$d1$d1$d3添加到集合$c,然后将三个文档中的每一个与$d1进行比较

$d1与自身进行比较会产生1 - 与预期完全匹配,而比较$d2$d3分别得到0.087和0 - 部分匹配且完全不匹配

我希望这能帮助您解决具体问题吗?

use strict;
use warnings 'all';

use Text::Document;
use Text::DocumentCollection;

my $d1 = Text::Document->new;
$d1->AddContent( 'my heart belongs to sally webster' );

my $d2 = Text::Document->new;
$d2->AddContent( 'my heart belongs to the girl next door' );

my $d3 = Text::Document->new;
$d3->AddContent( 'I want nothing to do with my neighbours' );

my $c = Text::DocumentCollection->new( file => 'coll2.db' );

$c->Add('one',   $d1);
$c->Add('two',   $d2);
$c->Add('three', $d3);

for my $doc ( $d1, $d2, $d3 ) {

    my $wcs = $d1->WeightedCosineSimilarity(
        $doc,
        \&Text::DocumentCollection::IDF,
        $c
    );

    die qq{Invalid parameters for "WeightedCosineSimilarity"} unless defined $wcs;

    print $wcs, "\n";
}

输出

1
0.0874311036726221
0


这是Text::Document::WeightedCosineSimilarity

的代码
# this is rather rough
sub WeightedCosineSimilarity
{
    my $self = shift;
    my ($e,$weightFunction,$rock) = @_;

    my ($Dv,$Ev) = ($self->{terms}, $e->{terms});

# compute union
    my %union =  %{$self->{terms}};
    my @keyse = keys %{$e->{terms}};
    @union{@keyse} = @keyse;
    my @allkeys = keys %union;

# weighted D
    my @Dw = map(( defined( $Dv->{$_} )?
        &{$weightFunction}( $rock, $_ )*$Dv->{$_} : 0.0 ),
        @allkeys
    );

# weighted E
    my @Ew = map(( defined( $Ev->{$_} )?
        &{$weightFunction}( $rock, $_ )*$Ev->{$_} : 0.0 ),
        @allkeys
    );

# dot product of D and E
    my $dotProduct = 0.0;
    map( $dotProduct += $Dw[$_] * $Ew[$_] , 0..$#Dw );

# norm of D
    my $nD = 0.0;
    map( $nD += $Dw[$_] * $Dw[$_] , 0..$#Dw );
    $nD = sqrt( $nD );

# norm of E
    my $nE = 0.0;
    map( $nE += $Ew[$_] * $Ew[$_] , 0..$#Ew );
    $nE = sqrt( $nE );

# dot product scaled by norm
    if( ($nD==0) || ($nE==0) ){
        return undef;
    } else {
        return $dotProduct / $nD / $nE;
    }
}

我害怕我不理解它背后的理论,但看起来你的问题是$nD(&#34; D&#34的规范;)或$nE(&#34; D&#34;的规范)为零

我可以建议的是,你的两个文本样本可能太相似/不同,或者它们太长/太短了?

无论哪种方式,您的代码应如下所示,以便从余弦函数中捕获无效的返回值:

my $wSim = $t1->WeightedCosineSimilarity( $t2,
    \&Text::DocumentCollection::IDF,
    $collection
);

die qq{Invalid parameters for "WeightedCosineSimilarity"} unless defined $wSim;

print "\nWeighted Cosine Sim is: $wSim\n";