
时间:2014-11-26 10:00:38

标签: performance perl matrix hash

INPUT.txt 。实际上,我有多达1000行,每行有1到100个元素。

9 11  
3 4  
1 9  
5 12  
1 11  
5 11  
9 12  
10 5 8  
7 4 1
and so on...  
last: 1 2 3 4 5 6 7 . . .any number of elements (100 in my case).

matrix.txt (TAB DELIMIITED)

1   1   1   1   1   1   0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   0   1   1   1   
1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   
1   1   1   1   1   1   0   1   1   1   1   1   1   1   1   0   1   1   1   1   0   0   1   1   1   1   1   1   
1   1   1   1   1   1   0   1   1   1   1   1   1   1   0   1   1   0   1   1   1   1   0   1   0   0   1   1   
1   1   1   1   1   1   0   1   1   1   1   1   1   1   1   1   1   0   1   1   1   1   1   1   1   1   1   0   
1   0   1   1   1   1   0   1   1   1   1   0   1   1   0   1   1   0   1   1   1   1   0   1   0   1   1   1   
1   1   1   1   1   1   0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   0   1   1   1   
1   0   1   1   1   1   0   1   1   1   1   0   1   1   0   0   1   0   1   1   1   1   1   1   0   0   1   1   
1   1   1   1   1   1   0   1   1   1   1   1   1   1   1   0   1   0   1   0   1   1   1   1   1   1   1   0   
and so on....upto 25000 lines


1   1   1   1   1   0   1   1   1   2   2   2   2   2 . . .columns upto number of lines in input.txt
1   1   1   1   1   1   1   1   1   2   2   2   2   2
1   0   0   1   1   1   1   1   1   2   2   2   2   2
1   1   1   0   1   0   0   1   1   2   2   2   2   2
1   1   1   1   1   1   1   1   0   2   2   2   2   2
1   1   1   0   1   0   1   1   1   1   2   2   2   2
1   1   1   1   1   0   1   1   1   2   2   2   2   2
1   1   1   1   1   0   0   1   1   1   2   2   2   2
0   1   1   1   1   1   1   1   0   2   2   2   2   2


use List::Util 'sum';
my @indexes = do {
    open my $fh, '<', "INPUT.txt";
    map { [map {$_ - 1} split ' '] } <$fh>
open my $infh, '<', "matrix.txt";
open OUT, '>', "output.txt";
while (<$infh>) {
    my @vals = split ' ';
    print OUT join('    ', map {sum(@vals[@$_])} @indexes), "\n";
close OUT;    



要看/尝试的一件事是cpan上的数学和矩阵导向模块。其中一些使用本机代码(是基于c的perl扩展),其中应该更快。这是他们的(过时的)入门书 -


输入尺寸(1 000 x 100)(25 000 x 100)的速度大约是其两倍。将整个矩阵读入内存,然后在同一运行时处理结果,但如果启用并行性可能会更快。如果您想知道大致的运行时间大小是多少,优化的c版本运行速度比原来大约快4倍(原始版本的8倍)。所有时间都与我的机器有关,但我希望在大多数计算机上都有类似的比例。我也没有声称我的PDL是最佳的,因为我以此为借口来学习它。

use strict;
use warnings;

use PDL;

my $indexes = PDL::long(do {
    open(my $fh, '<', 'INPUT.txt') or die;
    # The first map is if you allow duplicates in the index list (i.e. 2 2 is a valid row)
    # map { my $p = zeroes(100); $p->slice($_)++ foreach (map {$_ - 1} split /\t/); $p } <$fh>
    map { zeroes(100)->dice([map {$_ - 1} split /\t/])++ } <$fh>
})->xchg(0, 1);

open(my $input, '<', 'matrix.txt') or die;
open(my $output, '>', 'output.txt') or die;

while(<$input>) {
    my $vals = PDL::long(split(/\t/));
    print $output join("\t", ($vals x $indexes)->list) . "\n";

我问的原因 - 这是一种神圣的三位一体的性能瓶颈:

  • CPU - 在处理器上执行的实际操作
  • &#39;主动&#39;内存(内存配置文件与可用内存的大小以及您重新调整的程度)。
  • IO - 向/从磁盘传输数据。


map等操作是我开始关注的事情 - 像map / sort / grep这样的东西非常强大,但有可能使用不太理想的算法。

如果你受CPU限制,你可以尝试使用多线程或分叉来增加CPU访问。从表面上看,看起来就像你一样,不依赖于你对'matrix.txt&#39;的处理。 (例如,每一行都是独立的)因此它可能是并行性的良好候选者。

我正在考虑使用Parallel :: ForkManager来包装while循环。这样做的缺点是,您将对输出进行非确定性排序,这需要解决。


use List::Util 'sum';
use Data::Dumper;
use Fcntl qw(:flock);

use Parallel::ForkManager;

my $mgr = Parallel::ForkManager->new(10);

my @indexes = do {
    open my $fh, '<', "INPUT.txt";
    map {
        [ map { $_ - 1 } split ' ' ]
    } <$fh>;
open my $infh,   '<', "matrix.txt";
open my $out_fh, '>', "output.txt";
while (<$infh>) {
    $mgr->start and next;
    my @vals = split ' ';
    my $output_line = join( '    ', map { sum( @vals[@$_] ) } @indexes ),
        flock( $out_fh, LOCK_EX );
        print {$out_fh} $output_line;
close $out_fh;

注意 - 这会有效,但您会得到一个随机输出顺序,这几乎肯定不是您想要的。但它会同时使用10个处理器来进行“加入/映射/求和”。操作。



 use warnings;
 use strict;

use List::Util 'sum';

use threads; 
use Thread::Queue;

my $line_q = Thread::Queue -> new(); 
my $output_q = Thread::Queue -> new(); 

my %line_output : shared; 

    my @indexes = do {
        open my $fh, '<', "INPUT.txt";
        map {
            [ map { $_ - 1 } split ' ' ]
        } <$fh>;

sub generate_output {
   while ( my $item = $line_q -> dequeue() ) {
   print "processing $item \n";
       my ( $line_num, @vals ) = split ( ' ', $item );           
       $output_q -> enqueue($line_num.":". join('    ', map {sum(@vals[@$_])} @indexes ). "\n");

sub coalesce_output {
    open my $out_fh, '>', "output.txt";
    my $current_line = 0; 
    my %lines;
    while ( my $item = $output_q -> dequeue ) {
        my ( $line_num, $output_line ) = split ( ":", $item );
        if ( $line_num = $current_line ) { 
            print {$out_fh} $output_line;
        else {
           $lines{$line_num} = $output_line; 
        while ( defined $lines{$current_line} ) {
            print {$out_fh} $lines{$current_line};
            delete $lines{$current_line};

open my $infh,   '<', "matrix.txt";

my @workers;
for ( 1..10 ) {
  push ( @workers, threads -> create ( \&generate_output ) ); 

threads -> create ( \&coalesce_output );

while (my $line = <$infh>) {
    $line_q -> enqueue ( "$.: $line" );

$line_q -> end();
foreach my $thr ( @workers ) {
  $thr -> join(); 

$output_q -> end(); 



use warnings;
use strict;

use List::Util 'sum';

use threads;
use Thread::Queue;

my $line_q   = Thread::Queue->new();
my $output_q = Thread::Queue->new();

my @indexes = do {
    open my $fh, '<', "INPUT.txt";
    map {
        [ map { $_ - 1 } split ' ' ]
    } <$fh>;

sub generate_output {
    while ( my $item = $line_q->dequeue() ) {

        #print "processing $item \n";
        my ( $line_num, @vals ) = split( ' ', $item );
        $output_q->enqueue( $line_num . ":"
                . join( '    ', map { sum( @vals[@$_] ) } @indexes )
                . "\n" );

sub coalesce_output {
    open my $out_fh, '>', "output.txt";
    my $current_line = 1;
    my %lines;
    while ( my $item = $output_q->dequeue ) {

        my ( $line_num, $output_line ) = split( ":", $item );

        #     print "Got $line_num ($current_line) $item\n";
        if ( $line_num = $current_line ) {

            #   print "printing $current_line = $output_line\n";
            print {$out_fh} $output_line;
        else {
            $lines{$line_num} = $output_line;
        while ( defined $lines{$current_line} ) {

    #   print "printing  (while) $current_line = $lines{$current_line}\n";
            print {$out_fh} $lines{$current_line};
            delete $lines{$current_line};

open my $infh, '<', "matrix.txt";

my @workers;
for ( 1 .. 40 ) {
    push( @workers, threads->create( \&generate_output ) );

threads->create( \&coalesce_output );

while ( my $line = <$infh> ) {
    $line_q->enqueue("$. $line");

foreach my $thr (@workers) {

foreach my $thr ( threads -> list ) { $thr -> join(); }


 1    1    1    1    1    1    1    1    1    1    1    1    2    2    2    2
 2    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2
 2    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2
 2    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2
 2    2    2    2    2    2    2    2    2    2    2    2    2    2    3    3
 3    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3
 3    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3

最后 - 它取决于你的限制因素。


Started at 1417007048,
finished at 1417007064


Started at 1417007118
finished at 1417007161
