我是Perl的新手,为了我的一份作业,我提出了这样的解决方案:
#wordcount.pl FILE
#
#if no filename is given, print help and exit
if (length($ARGV[0]) < 1)
{
print "Usage is : words.pl word filename\n";
exit;
}
my $file = $ARGV[0]; #filename given in commandline
open(FILE, $file); #open the mentioned filename
while(<FILE>) #continue reading until the file ends
{
chomp;
tr/A-Z/a-z/; #convert all upper case words to lower case
tr/.,:;!?"(){}//d; #remove some common punctuation symbols
#We are creating a hash with the word as the key.
#Each time a word is encountered, its hash is incremented by 1.
#If the count for a word is 1, it is a new distinct word.
#We keep track of the number of words parsed so far.
#We also keep track of the no. of words of a particular length.
foreach $wd (split)
{
$count{$wd}++;
if ($count{$wd} == 1)
{
$dcount++;
}
$wcount++;
$lcount{length($wd)}++;
}
}
#To print the distinct words and their frequency,
#we iterate over the hash containing the words and their count.
print "\nThe words and their frequency in the text is:\n";
foreach $w (sort keys%count)
{
print "$w : $count{$w}\n";
}
#For the word length and frequency we use the word length hash
print "The word length and frequency in the given text is:\n";
foreach $w (sort keys%lcount)
{
print "$w : $lcount{$w}\n";
}
print "There are $wcount words in the file.\n";
print "There are $dcount distinct words in the file.\n";
$ttratio = ($dcount/$wcount)*100; #Calculating the type-token ratio.
print "The type-token ratio of the file is $ttratio.\n";
我已将评论纳入其中。实际上我必须从给定的文本文件中找到单词count。上述程序的输出如下:
The words and their frequency in the text is:
1949 : 1
a : 1
adopt : 1
all : 2
among : 1
and : 8
assembly : 1
assuring : 1
belief : 1
citizens : 1
constituent : 1
constitute : 1
.
.
.
The word length and frequency in the given text is:
1 : 1
10 : 5
11 : 2
12 : 2
2 : 15
3 : 18
There are 85 words in the file.
There are 61 distinct words in the file.
The type-token ratio of the file is 71.7647058823529.
即使在谷歌的帮助下,我也能找到我的作业解决方案。但是我认为使用Perl的真正功能将会有一个更小巧简洁的代码。任何人都可以用更少的代码行给我一个Perl解决方案吗?
答案 0 :(得分:9)
以下是一些建议:
在您的Perl脚本中包含use strict
和use warnings
。
您的参数验证不测试它应该测试的内容:(1)@ARGV
中是否只有1个项目,以及(2)该项目是否是有效的文件名。
虽然每条规则都有例外,但通常最好将<>
的返回值分配给命名变量,而不是依赖$_
。如果循环内的代码可能需要使用Perl的构造之一,而且依赖于$_
(例如,map
,grep
或后修复{{1},则尤其如此。 }}}})
for
Perl为小写字符串提供内置函数(while (my $line = <>){
...
}
)。
您正在线读取循环中执行不必要的计算。如果您只是建立一个单词的计数,您将获得所需的所有信息。另请注意,Perl为其大多数控制结构提供了单行表单(lc
,for
,while
等),如下所示。
if
然后,您可以使用单词tallies来计算您需要的其他信息。例如,唯一字的数量只是散列中的键数,而字总数是散列值的总和。
字长的分布可以这样计算:
while (my $line = <>){
...
$words{$_} ++ for split /\s+/, $line;
}
答案 1 :(得分:1)
使用像你这样的哈希是一个很好的方法。解析文件的更多perl方法是使用带有/ g标志的正则表达式来读取行中的单词。 \w+
表示一个或多个字母数字。
while( <FILE> )
{
while( /(\w+)/g )
{
my $wd = lc( $1 );
...
}
}