我有这种排序方法,它基本上只是基本的思维过程,不使用Perl电源,偶尔它不会按照我想要的方式行动(错过一些频率计数)。我想知道是否有更好的方法对此进行排序。
目标根据找到的匹配频率对数组进行排序。
数组的示例数组
##ADDED 1 to END of EACH ROW, just because my sort forced me too!!!
my @all_matches = (["chpt10_2", "sent. 2", "alice", "nsubj", "animals", "protect"],
["chpt12_1", "sent. 54", "bob", "nsubj", "cells", "protect"],
["chpt25_4", "sent. 47", "carol", "nsubj", "plants", "protect"],
["chpt34_1", "sent. 1", "dave", "nsubj", "cells", "protect"],
["chpt35_1", "sent. 2", "eli", "nsubj", "cells", "protect"],
["chpt38_1", "sent. 1", "fred", "nsubj", "animals", "protect"],
["chpt54_1", "sent. 1", "greg", "nsubj", "uticle", "protect"]
);
当前排序
@all_matches = sort {lc($a->[4]) cmp lc($b->[4])} @all_matches;
my ($last_word, $current_word, $word_count);
for my $j (0 .. $#all_matches) {
$current_word = $all_matches[$j][4];
if (lc($last_word) eq lc($current_word)) {
$word_count++;
}
else {
if ($j != 0)
{
for (my $k = 1; $k <= $word_count; $k++)
{
$all_matches[($j-$k)][6] = $word_count;
}
}
$last_word = $current_word;
$word_count = 1;
}
}
@all_matches = sort {$b->[6] <=> $a->[6] || lc($a->[4]) cmp lc($b->[4])} @all_matches;
问题当传入all_matches时,第6列设置为1!这样做的原因是因为有时候,计数($match->[6]
)是空白的。
奖金?匹配最后两列一起显示的次数(现在我很确定它只是检查第二列的第二列)。在这个测试用例中,最后一列是完全相同的,在实际情况下,最后有不同的后缀(即保护,保护,保护等)。
非常感谢你的时间。我尝试使用哈希,并认为它有效,但它忽略了一些事情。
这是我的哈希尝试。无法告诉你为什么这不起作用:
my %freq;
foreach ( map{$_->[4]}@results) #feeds in list of animals, cells, uticle, etc.
{
$freq{lc $_}++;
}
@results = sort {$freq{lc $b->[4]} <=> $freq{lc $a->[4]} #freq order
or
$a->[0] cmp $b->[0] #text col 0
} @results;
答案 0 :(得分:7)
为什么不创建具有出现次数的密钥的哈希值,并使用:
my %counts;
foreach my $rowref (@all_matches)
{
$counts{lc($rowref->[4])}++;
}
@all_matches = sort { $counts{lc($b->[4])} <=> $counts{lc($a->[4])} ||
lc($a->[4]) cmp lc($b->[4])
} @all_matches;
...测试
#!/usr/bin/env perl
use strict;
use warnings;
my @all_matches = (
["chpt10_2", "sent. 2", "alice", "nsubj", "animals", "protect"],
["chpt12_1", "sent. 54", "bob", "nsubj", "cells", "protect"],
["chpt25_4", "sent. 47", "carol", "nsubj", "plants", "protect"],
["chpt34_1", "sent. 1", "dave", "nsubj", "cells", "protect"],
["chpt35_1", "sent. 2", "eli", "nsubj", "cells", "protect"],
["chpt38_1", "sent. 1", "fred", "nsubj", "animals", "protect"],
["chpt54_1", "sent. 1", "greg", "nsubj", "uticle", "protect"]
);
my %counts;
foreach my $rowref (@all_matches)
{
$counts{lc($rowref->[4])}++;
}
@all_matches = sort { $counts{lc($b->[4])} <=> $counts{lc($a->[4])} ||
lc($a->[4]) cmp lc($b->[4])
} @all_matches;
my $i = 0;
foreach my $rowref (@all_matches)
{
$i++;
print "$i";
print " $_" foreach (@$rowref);
print "\n";
}
输出:
1 chpt12_1 sent. 54 bob nsubj cells protect
2 chpt34_1 sent. 1 dave nsubj cells protect
3 chpt35_1 sent. 2 eli nsubj cells protect
4 chpt10_2 sent. 2 alice nsubj animals protect
5 chpt38_1 sent. 1 fred nsubj animals protect
6 chpt25_4 sent. 47 carol nsubj plants protect
7 chpt54_1 sent. 1 greg nsubj uticle protect
如评论中所述,鉴于显示的数据,不需要lc
操作 - 删除它们可以提高性能,就像为每个数组添加一个大小写转换的密钥一样。
每行使用lc
一次 - 注意已提供的数据值:
#!/usr/bin/env perl
use strict;
use warnings;
my @all_matches = (
[ "chpt10_2", "sent. 2", "alice", "nsubj", "animAls", "protect" ],
[ "chpt12_1", "sent. 54", "bob", "nsubj", "celLs", "protect" ],
[ "chpt25_4", "sent. 47", "carol", "nsubj", "plAnts", "protect" ],
[ "chpt34_1", "sent. 1", "dave", "nsubj", "cElls", "protect" ],
[ "chpt35_1", "sent. 2", "eli", "nsubj", "cells", "protect" ],
[ "chpt38_1", "sent. 1", "fred", "nsubj", "Animals", "protect" ],
[ "chpt54_1", "sent. 1", "greg", "nsubj", "uticle", "protect" ],
);
my %counts;
foreach my $rowref (@all_matches)
{
push @$rowref, lc($rowref->[4]);
$counts{$rowref->[6]}++;
}
@all_matches = sort { $counts{$b->[6]} <=> $counts{$a->[6]} || $a->[6] cmp $b->[6]
} @all_matches;
my $i = 0;
foreach my $rowref (@all_matches)
{
$i++;
print "$i";
printf " %-9s", $_ foreach (@$rowref);
print "\n";
}
输出:
1 chpt12_1 sent. 54 bob nsubj celLs protect cells
2 chpt34_1 sent. 1 dave nsubj cElls protect cells
3 chpt35_1 sent. 2 eli nsubj cells protect cells
4 chpt10_2 sent. 2 alice nsubj animAls protect animals
5 chpt38_1 sent. 1 fred nsubj Animals protect animals
6 chpt25_4 sent. 47 carol nsubj plAnts protect plants
7 chpt54_1 sent. 1 greg nsubj uticle protect uticle
答案 1 :(得分:1)
试试这个:
my @all_matches = (["chpt10_2", "sent. 2", "alice", "nsubj", "animals", "protect"],
["chpt12_1", "sent. 54", "bob", "nsubj", "cells", "protect"],
["chpt25_4", "sent. 47", "carol", "nsubj", "plants", "protect"],
["chpt34_1", "sent. 1", "dave", "nsubj", "cells", "protect"],
["chpt35_1", "sent. 2", "eli", "nsubj", "cells", "protect"],
["chpt38_1", "sent. 1", "fred", "nsubj", "animals", "protect"],
["chpt54_1", "sent. 1", "greg", "nsubj", "uticle", "protect"]
);
my %wordcount;
foreach my $row (@all_matches) {
$wordcount{$row->[4]}++;
}
my @sorted = sort { $wordcount{$b->[4]} <=> $wordcount{$a->[4]} } @all_matches;