chdir("c:/perl/normalized");
$docid=0;
use List::MoreUtils qw( uniq );
my %hash = ();
@files = <*>;
foreach $file (@files)
{
$docid++;
open (input, $file);
while (<input>)
{
open (output,'>>c:/perl/postinglist/total');
chomp;
(@words) = split(" ");
foreach $word (@words)
{
push @{ $hash{$word} }, $docid;
}
}
}
foreach $key (sort keys %hash)
{
$size = scalar (@{$hash{$key}});
print output "Term: $key, Frequency:$size, Document(s):", join(" ", uniq @{ $hash{$key} }), "\n";
}
close (input);
close (output);
输出join(" ", uniq @{ $hash{$key} })
之前的如下:
Term:of Frequency:35 Document(s): 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 4 4 4 4 5 6 6 7 7 7 7 7 7 7 7 7
文件显示频率分布在哪里
Term:of Frequency:35 Document(s):1 2 3 4 5 6 7
这是好的,直到这里...我怎么想保留一个计数器删除重复项,这样我的新输出将是
Term:of Frequency:35 Document(s) of: 1(10) 2(7) 3(2) 4(4) 5(1) 6(2) 7(9)
即值(计数器)
我能够通过对源代码进行一些更改来修复自己的问题
chdir("c:/perl/normalized");
$docid=0;
my %hash = ();
@files = <*>;
foreach $file (@files)
{$counter=0;
$docid++;
open (input, $file);
while (<input>)
{
open (output,'>>c:/perl/tokens/total');
chomp;
(@words) = split(" ");
foreach $word (@words)
{
push @{ $hash{$word}{$docid}},$counter;
@{$hash{$word}{$docid}}[$counter]++;
}
}
}
foreach my $line (sort keys %hash) {
print output "Term:$line \n";
foreach my $elem (sort keys %{$hash{$line}}) {
print output" Doc:$elem " . "freq:".@{$hash{$line}->{$elem}} . "\n";
}
}
close (input);
close (output);
答案 0 :(得分:1)
最好的选择可能是使用哈希而不是数组,并将计数保持为哈希的值。变化
push @{ $hash{$word} }, $docid;
到
++$hash{$word}{$docid};
使用keys
获取文档ID。您将丢失订单,但可以使用数字排序轻松恢复。