如何使用perl索引语料库中的所有唯一单词

时间:2012-04-13 19:31:39

标签: perl indexing

.i 1
.t
 effici machineindepend procedur 
garbag collect  variou list structur
.w
 method  return regist   free
list   essenti part   list process
system.   paper past solut   recoveri
problem  review  compar.  new algorithm
 present  offer signific advantag  speed
 storag util.  routin  implement
 algorithm   written   list languag 
       insur  degre
 machin independ. final  applic  
algorithm   number  differ list structur
appear   literatur  indic.
.b
cacm august 1967
.a
schorr h.
wait w. m.
.n
ca670806 jb februari 27 1978 428 pm
.x
1024 4 1549

1024 4 1549

1050 4 1549

.i 2
.t
 comparison  batch process  instant turnaround
.w
 studi   program effort  student
  introductori program cours  present
  effect  have instant turnaround   minut
 oppos  convent batch process
 turnaround time    hour  examin. 
 item compar   number  comput
run  trip   comput center program prepar
time keypunch time debug time
number  run  elaps time    run
   run   problem.   
result  influenc   fact  bonu point
 given  complet   program problem
    specifi number  run 
 evid  support instant  batch.
.b
cacm august 1967
.a
smith l. b.
.n
ca670805 jb februari 27 1978 432 pm
.x
1550 4 1550

1550 4 1550

1304 5 1550

1472 5 1550

现在上面的文本是2个文件的内容,它们都被停止和阻止,新文件从.i开始(后跟一个数字)需要在.t&之间对文本中的单词进行索引。 .b,.b& .a,.a& .n,.n& .x并忽略.x和新文档开头之间的所有文本。即.I(后跟一个数字)

所有文件的内容都存储在一个文件中,例如'corpus'。需要将所有独特单词的索引与它们在语料库和每个文档中出现的次数一起索引,可以在文档中的哪个位置。

open FILE, '<', 'sometext.txt' or die $!;
my @texts = <FILE>;
foreach my $text(@texts) {
        my @lines = split ("\n",$text);
        foreach my $line(@lines) {
            my @words = split (" ",$text);
            foreach my $word(@words) {
                $word = trim($word);
                my $match = qr/$word/i;

                open STFILE, '<', 'sometext.txt' or die $!;
                my $count=0;

                while (<STFILE>) {
                    if ($_ =~ $match) {
                        my @mword = split /\s+/, $_;
                        $_ =~ s/[A-Za-z0-9_ ]//g;
                        for my $i (0..$#mword) {
                            if ($mword[$i] =~ $match) {
                                #print "match found on line $. word ", $i+1,"\n";
                                $count++
                            }
                        }
                    }
                }
                print "$word appears $count times \n";
                close(STFILE) or die "Couldn't close $file: $!\n\n";
            }
        }
    }


    close(FILE) or die "Couldn't close $file: $!\n\n";

    sub trim($)
{
    my $string = shift;
    $string =~ s/^\s+//;
    $string =~ s/\s+$//;
    return $string;
}

上述代码计算语料库中每个单词的出现次数。 如何更改它,以便它还计算单个文档中单词的出现次数。

2 个答案:

答案 0 :(得分:2)

怎么样:

修改为每个文档添加不同的计数器:

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;

my $words;
my $doc;
my $file = 'path/to/file';
open my $fh, '<', $file or die "unable to open '$file' for reading:$!"
while(<$fh>) {
    chomp;
    $doc = $_ if /^\.i/;
    next if (/^\.x\b/ .. /^\.i\b/);
    next if /^\./;
    my @words = split;
    for(@words) {
        $words->{$_}{$doc}++;
    }
}
close $fh;
print Dumper $words;

答案 1 :(得分:1)

使用散列,散列值包含每个单词的当前计数。循环遍历所有行和所有单词。使用基于哑(标志变量)的状态机来忽略.t和.b

之间的文本

如果您在编写上述任何代码时遇到困难,请发布一个特定问题,了解您遇到的问题。