Question

我有一个perl程序试图将一堆文件从一种格式转换为另一种格式（通过命令行工具）。它工作正常，但是因为它一次转换文件太慢了。

我研究并利用fork（）机制试图将所有转换产生为希望利用cpu / cores的子叉。

编码完成并经过测试，确实提高了性能，但并未达到我的预期。在查看/ proc / cpuinfo时，我有这个：

> egrep -e "core id" -e ^physical /proc/cpuinfo|xargs -l2 echo|sort -u
physical id : 0 core id : 0
physical id : 0 core id : 1
physical id : 0 core id : 2
physical id : 0 core id : 3
physical id : 1 core id : 0
physical id : 1 core id : 1
physical id : 1 core id : 2
physical id : 1 core id : 3

这意味着我每个都有2个CPU和四核？如果是这样，我应该能够分出8个叉子，并且假设我应该能够完成8分钟的工作（每个文件1分钟，8个文件）以1分钟完成（8个叉子，每个叉子1个文件）。

但是，当我测试运行时，仍然需要4分钟才能完成。它似乎只使用了2个CPU，而不是核心？

因此，我的问题是：

Perl的fork（）是否仅基于CPU而不是核心来并行？或许我没有做对吗？我只是使用fork（）和wait（）。没什么特别的。
我假设perl的fork（）应该使用内核，是否有一个简单的bash / perl我可以编写来证明我的操作系统（即RedHat 4），Perl是否是这种症状的罪魁祸首？

添加：

我甚至尝试多次运行以下命令来模拟多个处理并监控htop。

while true; do echo abc >>devnull; done &

不知怎的，htop告诉我，我有16个核心？然后当我生成上面的4个while循环时，我看到其中4个每个都使用~100％cpu。当我产生更多时，它们都会开始均匀地降低cpu利用率。（例如8处理，见htop中的8 bash，但每次使用约50％）这是否意味着什么？

非常感谢。我试过谷歌，但无法找到明显的答案。

编辑：2016-11-09

以下是perl代码的摘录。我很想知道我在这里做错了什么。

my $maxForks = 50;
my $forks = 0;
while(<CIFLIST>) {
    extractPDFByCIF($cifNumFromIndex, $acctTypeFromIndex, $startDate, $endDate);
}
for (1 .. $forks) {
    my $pid = wait();
    print "Child fork exited.  PID=$pid\n";
}

sub extractPDFByCIF {
    # doing SQL constructing to for the $stmt to do a DB query
    $stmt->execute();

    while ($stmt->fetch()) {
        # fork the copy/afp2web process into child process
        if ($forks >= $maxForks) {
            my $pid = wait();
            print "PARENTFORK: Child fork exited.  PID=$pid\n";
            $forks--;
        }
        my $pid = fork;
        if (not defined $pid) {
            warn "PARENTFORK: Could not fork.  Do it sequentially with parent thread\n";
        }
        if ($pid) {
            $forks++;
            print "PARENTFORK: Spawned child fork number $forks. PID=$pid\n";
        }else {
            print "CHILDFORK: Processing child fork. PID=$$\n";
            # prevent child fork to destroy dbh from parent thread
            $dbh->{InactiveDestroy} = 1;
            undef $dbh;

            # perform the conversion as usual
            if($fileName =~ m/.afp/){
                    system("file-conversion -parameter-list");
            } elsif($fileName =~ m/.pdf/) {
                    system("cp $from-file $to-file");
            } else {
                    print ERRORLOG "Problem happened here\r\n";
            }
            exit;
        }
        # end forking

    $stmt->finish();
    close(INDEX);
}

Answer 1

fork()产生一个新进程 - 与现有进程相同，状态相同。不多也不少。内核安排它并在任何地方运行它。

如果你没有得到你期望的结果，我建议更可能的限制因素是你正在从你的磁盘子系统中读取文件 - 磁盘速度很慢，并且竞争IO不是＆＃39;实际上使它们更快 - 如果有任何相反的事情，因为它会强制额外的驱动器寻求和不太容易的缓存。

具体而言：

1 /否，fork()只会克隆您的流程。

2 /除非您想将大部分算法重写为shell脚本，否则几乎没有意义。没有任何理由认为它会有所不同。

要继续编辑：

system('file-conversion')看起来很像基于IO的进程，它将受到磁盘IO的限制。和cp一样。
您是否考虑过Parallel::ForkManager大大简化了分叉位？

作为较小的风格点，你应该使用3 arg＆＃39; open＆＃39;。

#!/usr/bin/env perl
use strict;
use warnings;
use Parallel::ForkManager;

my $maxForks = 50;

my $manager = Parallel::ForkManager->new($maxForks);

while ($ciflist) {

    ## do something with $_ to parse.

    ##instead of: extractPDFByCIF($cifNumFromIndex, $acctTypeFromIndex, $startDate, $endDate);

    # doing SQL constructing to for the $stmt to do a DB query
    $stmt->execute();

    while ( $stmt->fetch() ) {

        # fork the copy/afp2web process into child process
        $manager->start and next;
        print "CHILDFORK: Processing child fork. PID=$$\n";

        # prevent child fork to destroy dbh from parent thread
        $dbh->{InactiveDestroy} = 1;
        undef $dbh;

        # perform the conversion as usual
        if ( $fileName =~ m/.afp/ ) {
            system("file-conversion -parameter-list");
        } elsif ( $fileName =~ m/.pdf/ ) {
            system("cp $from-file $to-file");
        } else {
            print ERRORLOG "Problem happened here\r\n";
        }

        # end forking
        $manager->finish;
    }
    $stmt->finish();

}

$manager->wait_all_children;

Answer 2

您的目标是以将多个核心作为独立资源的方式并行化您的应用程序。你想要实现的是多线程，特别是Perl的ithreads ~~使用对底层系统的fork()函数的调用（并且重量级因为这个原因）~~。您可以从perlthrtut向自己讲授Perl多线程方式。引自perlthrtut：

创建新的Perl线程时，与当前线程关联的所有数据都将复制到新线程，并随后专用于该新线程！这与Unix进程分叉时的情况类似，除了在这种情况下，数据只是在同一进程中复制到内存的不同部分而不是真正的fork。

话虽如此，关于你的问题：

~~你做得不对（抱歉）。~~ [看我的评论... ]多线程你不会我需要自己调用fork() ~~，但Perl会为你做这件事~~。
您可以检查您的Perl解释器是否已使用线程支持构建，例如通过perl -V（注意大写字母V）并查看消息。如果没有什么可以看到 threads 那么你的Perl解释器就不能进行Perl多线程了。

即使只有一个CPU内核使用fork()，您的应用程序已经更快的原因可能是，当一个进程必须等待文件系统等慢速资源时，另一个进程可以使用相同的内核作为同时计算资源。

perl fork（）似乎没有使用核心，只有cpu

2 个答案: