字符串perl之间的余弦相似度

时间:2016-09-13 11:03:29

标签: perl

我有一个文件包含例如这个文本:

 perl java python php scala 
 java pascal perl ruby ada   
 ASP awk php java perl 
 C# ada python java scala

我找到了一个计算余弦相似度的模块,http://search.cpan.org/~wollmers/Bag-Similarity-0.019/lib/Bag/Similarity/Cosine.pm

我在bignning做了一个简单的测试,

my $cosine = Bag::Similarity::Cosine->new;
 my $similarity = $cosine->similarity(['perl','java','python','php','scala'],['java','pascal','perl','ruby','ada']);
print $similarity;

rusult是0.4;

当我从文件中读取并计算每行之间的余弦时出现问题,结果不同,这就是代码:

open(F,"/home/ahmed/FILE.txt") or die " Pb pour ouvrir";
my @data; # containt each line of the FILE in each case

while(<F>) { 
    chomp; 
    push @data, $_;
}
#print join " ", @data;

 my $cosine = Bag::Similarity::Cosine->new;

for my $i ( 0 .. $#data-1 ) {

    for my $j ( $i + 1 .. $#data ) {

my $similarity = $cosine->similarity($data[$i],$data[$j]);

print "line $i a une similarite de  $similarity avec line $j\n";

 $i + 1,

            $j + 1;
}
}

结果:

line 0 has a similarity of 0.933424735647156 with line 1
line 0 has a similarity of 0.953945734121021 with line 2
line 0 has a similarity of 0.939759036144578 with line 3
line 1 has a similarity of  0.917585834612093 with line 2
line 1 has a similarity of  0.945092544842746 with line 3
line 2 has a similarity of  0.908826679128811 with line 3

第1行和第2行之间的相似性必须为0.4;

我改变了这样的文件:

['perl','java','python','php','scala'] 
['java','pascal','perl','ruby','ada']  
['ASP','awk','php','java','perl']
['C#','ada','python','java','scala']

但结果相同, 谢谢。

2 个答案:

答案 0 :(得分:1)

您的程序中存在语法错误。您是否尝试使用printf并错误地使用了print?不确定你,但下面对我来说很好。

#!/usr/bin/perl
use strict;
use warnings;
use Bag::Similarity::Cosine;

my $cosine = Bag::Similarity::Cosine->new;
my @data;

while ( <DATA> ) {
    push @data, { map { $_ => 1 } split };
}

for my $i ( 0 .. $#data-1 ) {
    for my $j ( $i + 1 .. $#data ) {
        my $similarity = $cosine->similarity($data[$i],$data[$j]);
        print "line $i has a similarity of $similarity with line $j\n";
    }
}

__DATA__
perl java python php scala
java pascal perl ruby ada
ASP awk php java perl
C# ada python java scala

输出:

line 0 has a similarity of 0.4 with line 1
line 0 has a similarity of 0.6 with line 2
line 0 has a similarity of 0.6 with line 3
line 1 has a similarity of 0.4 with line 2
line 1 has a similarity of 0.4 with line 3
line 2 has a similarity of 0.2 with line 3

答案 1 :(得分:0)

我对这个模块一无所知。但我可以阅读the documentation

在我看来,模块有两种方法。 similarity()用于比较两个字符串,from_bags()用于比较对包含字符串的数组的两个引用。我希望当你调用similarity传递两个数组引用时,那么比较的实际上是两个引用的字符串化。

尝试切换到from_bags()并查看是否更好。

更新:在进一步调查时,我发现similarity()将比较任何类型的输入(字符串,数组引用或散列引用)。

这演示了如何使用similarity()将行比较为文本和单词数组。

#!/usr/bin/perl

use strict;
use warnings;
use 5.010;

use Bag::Similarity::Cosine;

chomp(my @data = <DATA>);

my $cos = Bag::Similarity::Cosine->new;

for my $i (0 .. $#data - 1) {
  for my $j (1 .. $#data) {
    next if $i == $j;
    say "$i -> $j: strings ", $cos->similarity($data[$i], $data[$j]);
    say "$i -> $j: array refs ", $cos->similarity([split /\s+/, $data[$i]], [split /\s+/, $data[$j]]);
  }
}

__DATA__
perl java python php scala
java pascal perl ruby ada
ASP awk php java perl
C# ada python java scala

它给出了这个输出:

$ perl similar
0 -> 1: strings 0.88602000346543
0 -> 1: array refs 0.4
0 -> 2: strings 0.89566858950296
0 -> 2: array refs 0.6
0 -> 3: strings 0.852802865422442
0 -> 3: array refs 0.6
1 -> 2: strings 0.872356744289958
1 -> 2: array refs 0.4
1 -> 3: strings 0.884721984738799
1 -> 3: array refs 0.4
2 -> 1: strings 0.872356744289958
2 -> 1: array refs 0.4
2 -> 3: strings 0.753778361444409
2 -> 3: array refs 0.2

我不知道哪个版本会为您提供所需的信息。我怀疑它可能是数组引用版本。