我正试图获得两个文件的加权余弦相似度。我正在使用Text::Document和Text::DocumentCollection。我的代码似乎有效,但它没有像我预期的那样返回一个数字。
这是我的代码
use strict;
use warnings;
use Text::Document;
use Text::DocumentCollection;
my $newfile = shift @ARGV;
my $newfile2 = shift @ARGV;
##This is in another file.
my $t1 = countFreq($newfile);
my $t2 = countFreq($newfile2);
my $collection = Text::DocumentCollection->new(file => 'coll.db');
$collection->Add("One", $t1);
$collection->Add("Two", $t2);
my $wSim = $t1->WeightedCosineSimilarity( $t2,
\&Text::DocumentCollection::IDF,
$collection
);
print "\nWeighted Cosine Sim is: $wSim\n";
所有这些都返回Weighted Cosine Sim is:
,没有冒号后面的任何内容。
以下是countFreq的代码:
sub countFreq{
my ($file) = @_;
my $t1 = Text::Document->new();
open (my $info, $file) or die "Could not open file.";
while (my $line = <$info>) {
chomp $line;
$line =~ s/[[:punct:]]//g;
foreach my $str (split /\s+/, $line) {
if (!defined $sp{lc($str)}) {
$t1 -> AddContent ($str);
}
}
}
return $t1;
}
答案 0 :(得分:0)
这是一个工作正常的示例程序。它基于toolic's suggestion来查看分发中的测试代码以获取灵感
我期待测试的灵敏度要低得多,所以我从两个截然不同的文本源中得到了零。此示例将三个短句$d1
,$d1
和$d3
添加到集合$c
,然后将三个文档中的每一个与$d1
进行比较
将$d1
与自身进行比较会产生1 - 与预期完全匹配,而比较$d2
和$d3
分别得到0.087和0 - 部分匹配且完全不匹配
我希望这能帮助您解决具体问题吗?
use strict;
use warnings 'all';
use Text::Document;
use Text::DocumentCollection;
my $d1 = Text::Document->new;
$d1->AddContent( 'my heart belongs to sally webster' );
my $d2 = Text::Document->new;
$d2->AddContent( 'my heart belongs to the girl next door' );
my $d3 = Text::Document->new;
$d3->AddContent( 'I want nothing to do with my neighbours' );
my $c = Text::DocumentCollection->new( file => 'coll2.db' );
$c->Add('one', $d1);
$c->Add('two', $d2);
$c->Add('three', $d3);
for my $doc ( $d1, $d2, $d3 ) {
my $wcs = $d1->WeightedCosineSimilarity(
$doc,
\&Text::DocumentCollection::IDF,
$c
);
die qq{Invalid parameters for "WeightedCosineSimilarity"} unless defined $wcs;
print $wcs, "\n";
}
1
0.0874311036726221
0
这是Text::Document::WeightedCosineSimilarity
# this is rather rough
sub WeightedCosineSimilarity
{
my $self = shift;
my ($e,$weightFunction,$rock) = @_;
my ($Dv,$Ev) = ($self->{terms}, $e->{terms});
# compute union
my %union = %{$self->{terms}};
my @keyse = keys %{$e->{terms}};
@union{@keyse} = @keyse;
my @allkeys = keys %union;
# weighted D
my @Dw = map(( defined( $Dv->{$_} )?
&{$weightFunction}( $rock, $_ )*$Dv->{$_} : 0.0 ),
@allkeys
);
# weighted E
my @Ew = map(( defined( $Ev->{$_} )?
&{$weightFunction}( $rock, $_ )*$Ev->{$_} : 0.0 ),
@allkeys
);
# dot product of D and E
my $dotProduct = 0.0;
map( $dotProduct += $Dw[$_] * $Ew[$_] , 0..$#Dw );
# norm of D
my $nD = 0.0;
map( $nD += $Dw[$_] * $Dw[$_] , 0..$#Dw );
$nD = sqrt( $nD );
# norm of E
my $nE = 0.0;
map( $nE += $Ew[$_] * $Ew[$_] , 0..$#Ew );
$nE = sqrt( $nE );
# dot product scaled by norm
if( ($nD==0) || ($nE==0) ){
return undef;
} else {
return $dotProduct / $nD / $nE;
}
}
我害怕我不理解它背后的理论,但看起来你的问题是$nD
(&#34; D&#34的规范;)或$nE
(&#34; D&#34;的规范)为零
我可以建议的是,你的两个文本样本可能太相似/不同,或者它们太长/太短了?
无论哪种方式,您的代码应如下所示,以便从余弦函数中捕获无效的返回值:
my $wSim = $t1->WeightedCosineSimilarity( $t2,
\&Text::DocumentCollection::IDF,
$collection
);
die qq{Invalid parameters for "WeightedCosineSimilarity"} unless defined $wSim;
print "\nWeighted Cosine Sim is: $wSim\n";