Question

我想知道使用Parallel::ForkManager（或其他并行化工具）来处理我拥有的某些文件是否是一个好主意。基本上，我正在处理一个非常大的文件，并将其内容输出到多个文件中。这在64核服务器中通常需要约3个小时。

我想知道的是该模块的实现如何收集数据。例如，如果我这样做

use Parallel::ForkManager;
# Max 30 processes
my $pm = new Parallel::ForkManager(64);

open my $in,"<","D:\myfile.txt";
my @data=<$in>;
close $in;

#gathers unique dataheaders
my @uniqueheaders;
foreach my $line (@data){
  my @split=split "\t",$line;
  push @uniqueheaders,$split[0] unless (grep{$_=~/$split[0]} @uniqueheaders);
}

foreach my $head (@uniqueheaders) {
   $pm->start and next; # do the fork

   (my @matches) = grep{$_=~/^$head\t/} @data; #finds all matches in @data started by $head
   if($#matches>1){ #prints out if matches are found
      open my $out,">",'D:\directory\'."$head".'data';
      print $out @matches;
      close $out;
   }
   else{ print "Problem in $head!\n";}

   $pm->finish; # do the exit in the child process
}
$pm->wait_all_children;

现在，我的问题是：

你觉得制作这样的剧本有什么问题吗？每个$head会一次分配到一个核心，还是我必须注意其他我不知道的东西？
如果我想处理整个数据并输出一次怎么办？例如，在最后一个@gatherstuff循环之前创建一个数组foreach，而不是print，而是push @gatherstuff,@matches;。这就像我正在制作一样简单吗？

Answer 1

仅当您预处理文件以确定要分配给每个工作人员的范围时，将Parallel::ForkManager与单个输入文件一起使用可能最终有意义。并且，只有在您使用相同的输入多次重复工作时才有意义。

即使您可能从使用Parallel::ForkManager获得某些东西，但有30个进程尝试执行IO也不会给您带来任何好处。如果系统没有做任何其他事情，我建议的最多是内核数量的两倍，假设你有很多内存。

操作系统的缓存可能导致在初始预热后实际从内存中读取文件的不同进程lead to gains from having multiple processes do the processing。

由于多种原因，写入不太可能从多个进程中受益。进程将从整个内存空间读取，进程必须等待缓冲区刷新到磁盘等。在这种情况下，IO瓶颈肯定会更加突出。

Answer 2

在尝试使代码并行运行之前，请尝试查看是否可以优化代码以便在串行中高效运行。如果此优化的好处不够，那么您可以尝试使用Parallel::ForkManager。您的代码的一些问题是：

将整个文件读入内存：一次读取如此大量的行会大大增加程序的内存使用量，但如果执行时间也会增加。内存可能不是一个问题，但@data数组的重复重新分配会占用时间。如果RAM的数量较少，那么您将需要进行大量的磁盘交换，这会耗费更多时间。
grep用于代替＆＃39;包含＆＃39;检查： grep多次对如此大量的记录进行ping操作非常慢，根本无法扩展。截至目前，提取标头的过程的顺序为O(n^2)，其中n是输入文件中的记录数。如果您使用哈希，订单将为O(n)，这更易于管理。类似的论点适用于您提取匹配记录的方式。
＆＃39;标题＆＃39;在开始时提取：这在您当前并行运行代码的方法中可能是必要的，但您可以尝试避免这种情况，因为它会遍历所有记录。

这是我解决它的方式，而不是让代码并行运行。您可能需要使用ulimit -n命令增加允许的打开文件描述符数。

use strict;
use warnings;

my ($input_file, $output_dir) = (@ARGV);

die "Syntax: $0 <input_file> <output_dir>"
    unless $input_file and $output_dir;

open my $in, '<', $input_file
    or die "Could not open input file $input_file: $!";

# map of ID (aka header) -> file handle
my %idfh;

while (my $line = <$in>) {
    # extract the ID
    $line =~ /^(.+?)\t/;

    my $id = $1;
    # get the open file handle
    my $fh = $idfh{$id};

    unless ($fh) {
        # if there was no file handle for this ID, open a new one
        open $fh, '>', "$output_dir/${id}data"
            or die "Could not open file for ID $id: $!";

        $idfh{$id} = $fh;
    }

    # print the record to the correct file handle
    print $fh $line;
}

# perl automatically closes all file handles

这很简单：

遍历文件的每一行。对于每次迭代，请执行以下操作：
提取ID。
如果我们之前没有看过ID，请打开与写入ID相对应的文件。否则，请转到步骤4.
将文件句柄存储在以ID作为键的地图中。
如果之前看过ID，请从哈希中获取文件句柄。
通过文件句柄写下记录。

使用Parallel :: ForkManager处理文件

2 个答案: