Question

我正在努力完成以下任务：

有一个从非常大的文件读取数据的线程说 10GB并将它们推入队列。（我不希望队列到变得非常大）
虽然buildQueue线程同时将数据推送到队列中大约5个工作线程将排队并处理数据。

我已尝试但由于我buildQueue线程中的连续循环，我的其他线程无法访问。

我的方法可能完全错了。感谢您的帮助，非常感谢。

以下是buildQueue的代码：

sub buildQueue {
    print "Enter a file name: ";
    my $dict_path = <STDIN>;
    chomp($dict_path);
    open DICT_FILE, $dict_path or die("Sorry, could not open file!");
    while (1) {
        if (<DICT_FILE>) {
            if ($queue->pending() < 100) {
                 my $query = <DICT_FILE>;
                 chomp($query);
                 $queue->enqueue($query);
                 my $count = $queue->pending();
                 print "Queue Size: $count Query: $query\n";
            }
        }
    }
}

正如我所预料的那样，当执行此线程后，其他任何内容都将被执行，因为该线程将无法完成。

my $builder = new Thread(&buildQueue);

由于构建器线程将运行很长时间，我永远不会创建工作线程。

以下是整个代码：

#!/usr/bin/perl -w
use strict;
use Thread;
use Thread::Queue;


my $queue = new Thread::Queue();
my @threads;

sub buildQueue {
    print "Enter a file name: ";
    my $dict_path = <STDIN>;
    chomp($dict_path);
    open dict_file, $dict_path or die("Sorry, could not open file!");
    while (1) {
        if (<dict_file>) {
            if ($queue->pending() < 100) {
                 my $query = <dict_file>;
                 chomp($query);
                 $queue->enqueue($query);
                 my $count = $queue->pending();
                 print "Queue Size: $count Query: $query\n";
            }
        }
    }
}

sub processor {
    my $query;
    while (1) {
        if ($query = $queue->dequeue) {
            print "$query\n";
        }
    }
}

my $builder = new Thread(&buildQueue);
push @threads, new Thread(&processor) for 1..5;

Answer 1

您需要标记何时希望线程退出（通过joinor detach）。事实上，你有无限循环，没有last语句来突破它们也是一个问题。

编辑：我也忘记了一个非常重要的部分！ Each worker thread will block, waiting for another item to process off of the queue until they get an undef in the queue。因此，为什么我们在队列构建器完成后为每个线程专门排队undef。

尝试：

#!/usr/bin/perl -w
use strict;
use threads;
use Thread::Queue;


my $queue = new Thread::Queue();
our @threads; #Do you really need our instead of my?

sub buildQueue
{
    print "Enter a file name: ";
    my $dict_path = <STDIN>;
    chomp($dict_path);

    #Three-argument open, please!
    open my $dict_file, "<",$dict_path or die("Sorry, could not open file!");
    while(my $query=<$dict_file>)
    {
        chomp($query);
        while(1)
        {   #Wait to see if our queue has < 100 items...
            if ($queue->pending() < 100) 
            {
                $queue->enqueue($query);
                print "Queue Size: " . $queue->pending . "\n";
                last; #This breaks out of the infinite loop
            }
        }
    }
    close($dict_file);
    foreach(1..5)
    {
        $queue->enqueue(undef);
    }
}

sub processor 
{
    my $query;
    while ($query = $queue->dequeue) 
    {
        print "Thread " . threads->tid . " got $query\n";
    }
}

my $builder=threads->create(\&buildQueue);
push @threads,threads->create(\&process) for 1..5;

#Waiting for our threads to finish.
$builder->join;
foreach(@threads)
{
    $_->join;
}

Answer 2

Perl的MCE模块喜欢大文件。使用MCE，可以同时对多行进行分块，将一大块作为标量字符串，或一次读取一行。一次分割多行可以减少IPC的开销。

MCE 1.504现已推出。它为MCE :: Queue提供了对子进程（包括线程）的支持。此外，1.5版本附带5个模型（MCE :: Flow，MCE :: Grep，MCE :: Loop，MCE :: Map和MCE :: Stream），它们负责实例化MCE实例以及自动调整max_workers和chunk_size。人们可以改写这些选项顺便说一句。

下面，MCE :: Loop用于演示。

use MCE::Loop;

print "Enter a file name: ";
my $dict_path = <STDIN>;
chomp($dict_path);

mce_loop_f {
   my ($mce, $chunk_ref, $chunk_id) = @_;

   foreach my $line ( @$chunk_ref ) {
      chomp $line;
      ## add your code here to process $line
   }

} $dict_path;

如果要指定worker和/或chunk_size的数量，则有两种方法可以执行此操作。

use MCE::Loop max_workers => 5, chunk_size => 300000;

或者...

use MCE::Loop;

MCE::Loop::init {
   max_workers => 5,
   chunk_size  => 300000
};

虽然大块文件首选分块，但可以将时间与分块一行进行比较。可以省略块内的第一行（注释掉）。注意如何不需要内部for循环。 $ chunk_ref仍然是一个包含1行的数组引用。输入标量$ _包含chunk_size等于1时的行，否则指向$ chunk_ref。

use MCE::Loop;

MCE::Loop::init {
   max_workers => 5,
   chunk_size  => 1
};

print "Enter a file name: ";
my $dict_path = <STDIN>;
chomp($dict_path);

mce_loop_f {
 # my ($mce, $chunk_ref, $chunk_id) = @_;

   my $line = $_;
   ## add your code here to process $line or $_

} $dict_path;

我希望这个演示对那些想要并行处理文件的人有所帮助。

:) mario

Answer 3

听起来这种情况与Parallel::ForkManager模块有关。

Answer 4

另一种方法：您还可以在MCE 1.2+中使用user_tasks并创建两个多工作者 multithreading tasks，一个用于阅读的任务（因为它是一个大文件，你也可以从保存文件读取的并行读取中受益）和一个处理任务等。

下面的代码仍使用Thread::Queue来管理缓冲区队列。

buildQueue sub具有您的队列大小控制权，它将数据直接推送到管理器进程'$ R_QUEUE，因为我们已经使用了线程，因此它可以访问父节点的内存空间。如果您想使用forks，您仍然可以通过回调函数访问队列。但在这里我选择只是推进队列。

processQueue sub将简单地对队列中的任何内容进行排队，直到没有其他内容为止。

每个任务的task_end个子在每个任务结束时只由管理员进程运行一次，因此我们用它来表示我们的工作进程停止。

显然，您希望如何将数据分块给工作人员有很多自由，因此您可以决定数据块的大小，甚至可以决定如何将数据插入其中。

#!/usr/bin/env perl
use strict;
use warnings;
use threads;
use threads::shared;
use Thread::Queue;
use MCE;

my $R_QUEUE = Thread::Queue->new;
my $queue_workers = 8;
my $process_workers = 8;
my $chunk_size = 1;

print "Enter a file name: ";
my $input_file = <STDIN>;
chomp($input_file);

sub buildQueue {
    my ($self, $chunk_ref, $chunk_id) = @_;
    if ($R_QUEUE->pending() < 100) {
        $R_QUEUE->enqueue($chunk_ref);
        $self->sendto('stdout', "Queue Size: " . $R_QUEUE->pending ."\n");
    }
}

sub processQueue {
    my $self = shift;
    my $wid = $self->wid;
    while (my $buff = $R_QUEUE->dequeue) {
        $self->sendto('stdout', "Thread " . $wid . " got $$buff");
    }
}

my $mce = MCE->new(
    input_data => $input_file, # this could be a filepath or a file handle or even a scalar to treat like a file, check the documentation for more details.
    chunk_size => $chunk_size,
    use_slurpio => 1,

    user_tasks => [
        { # queueing task
            max_workers => $queue_workers,
            user_func => \&buildQueue,
            use_threads => 1, # we'll use threads to have access to the parent's variables in shared memory.
            task_end => sub { $R_QUEUE->enqueue( (undef) x $process_workers ) } # signal stop to our process workers when they hit the end of the queue. Thanks > Jack Maney!
        },
        { # process task
            max_workers => $process_workers,
            user_func => \&processQueue,
            use_threads => 1, # we'll use threads to have access to the parent's variables in shared memory
            task_end => sub { print "Finished processing!\n"; }
        }
    ]
);

$mce->run();

exit;

Perl队列和线程

4 个答案: