我有一个带键值的散列作为标量字符串。该值是另一个散列,其中字符串的单词为键,其频率为值。
结构:
{
doc1 => { w1 => freq1 , w2 => freq2, .....} ,
doc2 => { w1 => freq1 , w2 => freq2, .....} ,
.
.
.
}
我想比较两个键(doc1,doc2 ......)并找到两个文档之间的常用词。所需的输出是两个文档之间的常用单词的频率之和,对于所有文档对。
哪种方法最好?
答案 0 :(得分:0)
类似
#!/usr/bin/perl
use strict;
use warnings;
# Sum of frequencies
my @frequencies;
# First doc
my $doc1 = {
w1 => 1 , w2 => 5, w3 => 1
};
# Second doc
my $doc2 = {
w1 => 3 , w2 => 2, w3 => 1, w4 => 12
};
# see first doc
foreach my $word (keys %{$doc1}) {
if (exists $doc2->{$word}) {
push (@frequencies, {$word => $doc1->{$word} + $doc2->{$word}});
}
else {
push (@frequencies, {$word => $doc1->{$word}});
}
delete $doc2->{$word};
}
# see second doc
foreach my $word (keys %{$doc2}) {
push (@frequencies, {$word => $doc2->{$word}});
}
# See sum of frequencies
print join "\n", map {sprintf("%s: %s", keys %$_, values %$_)} @frequencies;
1;
输出
$ perl compare.pl
w3: 2
w1: 4
w2: 7
w4: 12