我正在尝试在给定的文档语料库中创建单词及其位置的反向索引。我要针对的数据结构示例如下:
+----------+--------------------------------------------------------------+
| Word | Location |
+----------+--------------------------------------------------------------+
| 'word 1' | 'doc1' 'title', 'doc4' 'text', 'doc7' 'title' 'text' |
+----------+--------------------------------------------------------------+
其中“标题”和“文本”是可能的位置
我解析和生成数据的代码是:
while (my $line = <$fh>) {
# determine doc no and location within docs
....
#iterate words in a given location within a document
foreach my $str ($line =~ /[[:alpha:]]+/g) {
push @{ $doc{$docno} }, $location;
push @{ $wordlist{$str} }, $doc{$docno};
}
}
我要打印数据的代码是:
foreach my $str (reverse sort { $wordlist{$a} <=> $wordlist{$b} } keys %wordlist) {
printf $fo "%-15s %-15s \n", $str, "@{ $wordlist{$str} }";
}
但是,结果是:
+----------+--------------------------------------------------------------+
| Word | Location |
+----------+--------------------------------------------------------------+
| 'word1' | ARRAY(0x66d4508) ARRAY(0x66d4508) ARRAY(0x66d4508) |
+----------+--------------------------------------------------------------+
我哪里出错了?
编辑:
我尝试将打印代码更改为:
foreach my $str (reverse sort { $wordlist{$a} <=> $wordlist{$b} } keys %wordlist) {
printf "%-15s", $str;
@arr = @{ $wordlist{$str} };
foreach $arr (@arr)
{
print "@{ $arr }: , ";
}
print "\n";
}
但是结果是:
word101 title title text text text text text text ...
我不知道如何在所述文档中的位置旁边打印文档编号
答案 0 :(得分:0)
您的数据结构丢弃了您所需要的信息。
只需执行以下操作:
while (my $line = <$fh>) {
# determine doc no and location within docs
....
#iterate words in a given location within a document
foreach my $str ($line =~ /[[:alpha:]]+/g) {
push $worldlist{Sstr}->@*, {
docno => $docno,
location => $location
};
}
}
这使打印数据结构的工作变得微不足道。