Question

我有三个文件（两个以制表符分隔的字段，文件之间没有冗余）。我想并行读取它们并将它们的内容存储在一个哈希中。

这就是我的尝试：

use warnings;
use strict;
use Parallel::ForkManager;
use Data::Dumper;

my @files = ('aa', 'ab', 'ac');

my %content;
my $max_processors = 3;
my $pm = Parallel::ForkManager->new($max_processors);

foreach my $file (@files) {
    $pm->start and next;

    open FH, $file or die $!;
    while(<FH>){
        chomp;
        my($field1, $field2) = split/\t/,$_;
        $content{$field1} = $field2;
    }
    close FH;

    $pm->finish;
}
$pm->wait_all_children;

print Dumper \%content;

此脚本的输出是

$ VAR1 = {};

我可以看到这三个文件是并行处理的但是......我怎样才能存储三个文件的内容以进行后期处理？

Answer 1

当您进行分叉时，子进程会拥有自己独立的内存，因此父母无法访问您已阅读过的数据。您必须找到一种方法让孩子可以通过管道传回数据，但此时你也可以不用分叉而只是直接阅读数据。

你可能想要研究的是使用线程，因为它们共享相同的内存。

Answer 2

您可以使用run_on_finish()回调执行此操作，并将数据存储为参考文件，例如文件名作为密钥（有关示例，请参阅文档的Data structure retrieval部分）。

所以，如果你让你的文件读取代码是一个子程序，让它返回数据作为参考，然后使用回调，你可能会得到这样的结果：

#!/usr/bin/perl
use strict;
use warnings;
use autodie;

use Parallel::ForkManager;
use Data::Dump;

sub proc_file {
    # Read the file and split into a hash; assuming the data struct, based on
    # OP's example.
    my $file = shift;
    open(my $fh, "<", $$file);
    my %content = map{ chomp; split(/\t/) }<$fh>;
    return \%content;
}

my %content;
my @files = ('aa','ab','ac');

my $pm = new Parallel::ForkManager(3);
$pm->run_on_finish(
    sub {
        my ($pid, $exit_code, $ident, $exit_signal, $core_dump, $data_structure_reference) = @_;
        my $input_file = $data_structure_reference->{input};
        $content{$input_file} = $data_structure_reference->{result};
    }
);

# For each file, fork a child, and on finish create an object ref to the file
# and the results of processing, that can be stored in the $data_structure_reference.
for my $input_file (@files) {
    $pm->start and next;
    my $return_data = proc_file(\$input_file);

    $pm->finish(0,
        {
          result  => $return_data,
          input   => $input_file,
        }
     );
}
$pm->wait_all_children;

dd \%content;

这将导致哈希值散列，其中文件名为键，内容为子哈希值，您可以轻松地将其折叠或汇集在一起或者您喜欢的任何内容：

$ ./parallel.pl a*
{
  aa => { apple => "pear" },
  ab => { Joe => "Wilson" },
  ac => { "New York" => "Mets" },
}

请注意，与任何分叉程序一样，相关的开销成本相当高，这可能不会最终加快处理速度，而不仅仅是简单地循环遍历文件。

Perl：使用Parallel :: ForkManager存储许多文件内容

2 个答案: