Question

我的目标是在perl中构建一个倒排索引文件：我的文件中有1000万行+：

document id:  citing document 1; citing document 2;

示例：

document 56: document 12, document 45
document 117: document 12, document 22, document 99

我希望以下列形式创建另一个文件：

document 12: document 117, document 56 
...

目前我正逐行阅读源文件，并为每个引文附加索引文件（每个文档一行）。但为每个引用附加索引文件（In Perl, how do I change, delete, or insert a line in a file, or append to the beginning of a file?）非常慢。任何替代/更有效的方法？感谢。

Answer 1

而不是修改索引文件采用以下算法：

将反向索引文件加载到哈希结构
阅读每个文档并将引用添加到哈希结构
写入倒排索引文件。

Answer 2

您想要读入文件并使用数据构建哈希。这应该让你开始

use strict;
use warnings;
use 5.010;

my %cited; # results go here

while (<DATA>) { # really read from your file
    chomp;
    my ($doc, @cites) = split /:\s+|,\s+/;
    for (@cites) {
        push @{$cited{$_}}, $doc;
    }
}
for (sort keys %cited) {
    say "$_ cited in: ", join ", ", sort @{$cited{$_}};
}

__DATA__
document 56: document 12, document 45
document 117: document 12, document 22, document 99
document 17: document 67, document 22, document 1

使用perl进行大数据集的反向索引生成

2 个答案: