文件中最常用的100个字符串

时间:2012-03-31 15:22:46

标签: perl file sorting hash word

如何使用Perl在.txt文件中找到前100个最常用的字符串(单词)?到目前为止,我有以下内容:

use 5.012;
use warnings;

open(my $file, "<", "file.txt");

my %word_count;
while (my $line = <$file>) {
  foreach my $word (split ' ', $line) {
     $word_count{$word}++;
  } 
} 

for my $word (sort keys %word_count) {
  print "'$word': $word_count{$word}\n";
}

但这仅计算每个单词,并按字母顺序组织。我想要文件中前100个最常用的单词,按出现次数排序。有什么想法吗?

相关:Count number of times string repeated in files perl

1 个答案:

答案 0 :(得分:8)

通过阅读精美的 perlfaq4 (1)联机帮助页,可以了解how to sort hashes by value。所以试试吧。它比你的方法更具惯用性“perlian”。

#!/usr/bin/env perl    
use v5.12;
use strict;
use warnings;
use warnings FATAL => "utf8";
use open qw(:utf8 :std);

my %seen;
while (<>) {
    $seen{$_}++ for split /\W+/;  # or just split;
}

my $count = 0;
for (sort {
        $seen{$b} <=> $seen{$a}
                  ||
           lc($a) cmp lc($b)    # XXX: should be v5.16's fc() instead
                  ||
              $a  cmp  $b
     } keys %seen)
{
    next unless /\w/;
    printf "%-20s %5d\n", $_, $seen{$_};
    last if ++$count > 100;
}

当对自己运行时,前10行输出为:

seen                     6
use                      5
_                        3
a                        3
b                        3
cmp                      2
count                    2
for                      2
lc                       2
my                       2