.i 1
.t
effici machineindepend procedur
garbag collect variou list structur
.w
method return regist free
list essenti part list process
system. paper past solut recoveri
problem review compar. new algorithm
present offer signific advantag speed
storag util. routin implement
algorithm written list languag
insur degre
machin independ. final applic
algorithm number differ list structur
appear literatur indic.
.b
cacm august 1967
.a
schorr h.
wait w. m.
.n
ca670806 jb februari 27 1978 428 pm
.x
1024 4 1549
1024 4 1549
1050 4 1549
.i 2
.t
comparison batch process instant turnaround
.w
studi program effort student
introductori program cours present
effect have instant turnaround minut
oppos convent batch process
turnaround time hour examin.
item compar number comput
run trip comput center program prepar
time keypunch time debug time
number run elaps time run
run problem.
result influenc fact bonu point
given complet program problem
specifi number run
evid support instant batch.
.b
cacm august 1967
.a
smith l. b.
.n
ca670805 jb februari 27 1978 432 pm
.x
1550 4 1550
1550 4 1550
1304 5 1550
1472 5 1550
现在上面的文本是2个文件的内容,它们都被停止和阻止,新文件从.i开始(后跟一个数字)需要在.t&之间对文本中的单词进行索引。 .b,.b& .a,.a& .n,.n& .x并忽略.x和新文档开头之间的所有文本。即.I(后跟一个数字)
所有文件的内容都存储在一个文件中,例如'corpus'。需要将所有独特单词的索引与它们在语料库和每个文档中出现的次数一起索引,可以在文档中的哪个位置。
open FILE, '<', 'sometext.txt' or die $!;
my @texts = <FILE>;
foreach my $text(@texts) {
my @lines = split ("\n",$text);
foreach my $line(@lines) {
my @words = split (" ",$text);
foreach my $word(@words) {
$word = trim($word);
my $match = qr/$word/i;
open STFILE, '<', 'sometext.txt' or die $!;
my $count=0;
while (<STFILE>) {
if ($_ =~ $match) {
my @mword = split /\s+/, $_;
$_ =~ s/[A-Za-z0-9_ ]//g;
for my $i (0..$#mword) {
if ($mword[$i] =~ $match) {
#print "match found on line $. word ", $i+1,"\n";
$count++
}
}
}
}
print "$word appears $count times \n";
close(STFILE) or die "Couldn't close $file: $!\n\n";
}
}
}
close(FILE) or die "Couldn't close $file: $!\n\n";
sub trim($)
{
my $string = shift;
$string =~ s/^\s+//;
$string =~ s/\s+$//;
return $string;
}
上述代码计算语料库中每个单词的出现次数。 如何更改它,以便它还计算单个文档中单词的出现次数。
答案 0 :(得分:2)
怎么样:
修改为每个文档添加不同的计数器:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $words;
my $doc;
my $file = 'path/to/file';
open my $fh, '<', $file or die "unable to open '$file' for reading:$!"
while(<$fh>) {
chomp;
$doc = $_ if /^\.i/;
next if (/^\.x\b/ .. /^\.i\b/);
next if /^\./;
my @words = split;
for(@words) {
$words->{$_}{$doc}++;
}
}
close $fh;
print Dumper $words;
答案 1 :(得分:1)
使用散列,散列值包含每个单词的当前计数。循环遍历所有行和所有单词。使用基于哑(标志变量)的状态机来忽略.t和.b
之间的文本如果您在编写上述任何代码时遇到困难,请发布一个特定问题,了解您遇到的问题。