如何以更多的方式完成此操作

时间:2011-10-09 12:34:54

标签: perl

我是Perl的新手,为了我的一份作业,我提出了这样的解决方案:

#wordcount.pl FILE 
    # 

    #if no filename is given, print help and exit 
    if (length($ARGV[0]) < 1) 
    { 
           print "Usage is : words.pl word filename\n"; 
           exit; 
    } 

   my $file = $ARGV[0];          #filename given in commandline 

   open(FILE, $file);            #open the mentioned filename 
   while(<FILE>)                 #continue reading until the file ends 
    { 
           chomp; 
           tr/A-Z/a-z/;          #convert all upper case words to lower case 
           tr/.,:;!?"(){}//d;            #remove some common punctuation symbols 
           #We are creating a hash with the word as the key.  
           #Each time a word is encountered, its hash is incremented by 1. 
           #If the count for a word is 1, it is a new distinct word. 
           #We keep track of the number of words parsed so far. 
           #We also keep track of the no. of words of a particular length.  

          foreach $wd (split) 
          { 
                $count{$wd}++; 
                if ($count{$wd} == 1) 
                 { 
                       $dcount++; 
                 } 
                $wcount++; 
                $lcount{length($wd)}++; 
          } 
   } 

   #To print the distinct words and their frequency,  
   #we iterate over the hash containing the words and their count. 
   print "\nThe words and their frequency in the text is:\n"; 
   foreach $w (sort keys%count) 
   { 
         print "$w : $count{$w}\n"; 
   } 

   #For the word length and frequency we use the word length hash 
   print "The word length and frequency in the given text is:\n"; 
   foreach $w (sort keys%lcount) 
   { 
         print "$w : $lcount{$w}\n"; 
   } 

   print "There are $wcount words in the file.\n"; 
   print "There are $dcount distinct words in the file.\n"; 

   $ttratio = ($dcount/$wcount)*100;       #Calculating the type-token ratio. 

   print "The type-token ratio of the file is $ttratio.\n"; 

我已将评论纳入其中。实际上我必须从给定的文本文件中找到单词count。上述程序的输出如下:

The words and their frequency in the text is: 
1949 : 1
a : 1
adopt : 1
all : 2
among : 1
and : 8
assembly : 1
assuring : 1
belief : 1
citizens : 1
constituent : 1
constitute : 1
.
.
.
The word length and frequency in the given text is:
1 : 1
10 : 5
11 : 2
12 : 2
2 : 15
3 : 18
There are 85 words in the file. 
There are 61 distinct words in the file. 
The type-token ratio of the file is 71.7647058823529. 

即使在谷歌的帮助下,我也能找到我的作业解决方案。但是我认为使用Perl的真正功能将会有一个更小巧简洁的代码。任何人都可以用更少的代码行给我一个Perl解决方案吗?

2 个答案:

答案 0 :(得分:9)

以下是一些建议:

  • 在您的Perl脚本中包含use strictuse warnings

  • 您的参数验证不测试它应该测试的内容:(1)@ARGV中是否只有1个项目,以及(2)该项目是否是有效的文件名。

  • 虽然每条规则都有例外,但通常最好将<>的返回值分配给命名变量,而不是依赖$_。如果循环内的代码可能需要使用Perl的构造之一,而且依赖于$_(例如,mapgrep或后修复{{1},则尤其如此。 }}}})

    for
  • Perl为小写字符串提供内置函数(while (my $line = <>){ ... } )。

  • 您正在线读取循环中执行不必要的计算。如果您只是建立一个单词的计数,您将获得所需的所有信息。另请注意,Perl为其大多数控制结构提供了单行表单(lcforwhile等),如下所示。

    if
  • 然后,您可以使用单词tallies来计算您需要的其他信息。例如,唯一字的数量只是散列中的键数,而字总数是散列值的总和。

  • 字长的分布可以这样计算:

    while (my $line = <>){
        ...
        $words{$_} ++ for split /\s+/, $line;
    }
    

答案 1 :(得分:1)

使用像你这样的哈希是一个很好的方法。解析文件的更多perl方法是使用带有/ g标志的正则表达式来读取行中的单词。 \w+表示一个或多个字母数字。

while( <FILE> )
{
    while( /(\w+)/g )
    {
        my $wd = lc( $1 );
        ...

     }
 }