Question

编辑：添加了解决方案。

嗨，我目前有一些工作虽然代码很慢。

它使用主键逐行合并2个 CSV 文件。例如，如果文件1包含以下行：

"one,two,,four,42"

和文件2有这一行;

"one,,three,,42"

其中0索引$ position = 4的主键= 42;

然后sub：merge_file（$ file1，$ file2，$ outputfile，$ position）;

将输出一行文件：

"one,two,three,four,42";

每个主密钥在每个文件中都是唯一的，密钥可能存在于一个文件中但不存在于另一个文件中（反之亦然）

每个文件大约有100万行。

通过第一个文件中的每一行，我使用哈希来存储主键，并将行号存储为值。行号对应于一个数组[行号]，它存储第一个文件中的每一行。

然后我遍历第二个文件中的每一行，并检查主键是否在哈希中，如果是，则从file1array获取行，然后将我需要的列从第一个数组添加到第二个数组，然后concat。到最后。然后删除哈希值，然后在最后，将整个事件转储到文件中。（我正在使用SSD，所以我想最小化文件写入。）

最好用代码解释：

sub merge_file2{
 my ($file1,$file2,$out,$position) = ($_[0],$_[1],$_[2],$_[3]);
 print "merging: \n$file1 and \n$file2, to: \n$out\n";
 my $OUTSTRING = undef;

 my %line_for;
 my @file1array;
 open FILE1, "<$file1";
 print "$file1 opened\n";
 while (<FILE1>){
      chomp;
      $line_for{read_csv_string($_,$position)}=$.; #reads csv line at current position (of key)
      $file1array[$.] = $_; #store line in file1array.
 }
 close FILE1;
 print "$file2 opened - merging..\n";
 open FILE2, "<", $file2;
 my @from1to2 = qw( 2 4 8 17 18 19); #which columns from file 1 to be added into cols. of file 2.
 while (<FILE2>){
      print "$.\n" if ($.%1000) == 0;
      chomp;
      my @array1 = ();
      my @array2 = ();
      my @array2 = split /,/, $_; #split 2nd csv line by commas

      my @array1 = split /,/, $file1array[$line_for{$array2[$position]}];
      #                            ^         ^                  ^
      # prev line  lookup line in 1st file,lookup hash,     pos of key
      #my @output = &merge_string(\@array1,\@array2); #merge 2 csv strings (old fn.)

      foreach(@from1to2){
           $array2[$_] = $array1[$_];
      }
      my $outstring = join ",", @array2;
      $OUTSTRING.=$outstring."\n";
      delete $line_for{$array2[$position]};
 }
 close FILE2;
 print "adding rest of lines\n";
 foreach my $key (sort { $a <=> $b } keys %line_for){
      $OUTSTRING.= $file1array[$line_for{$key}]."\n";
 }

 print "writing file $out\n\n\n";
 write_line($out,$OUTSTRING);
}

第一次很好，不到1分钟，但第二次循环需要大约1小时才能运行，我想知道我是否采取了正确的方法。我认为有可能加速很多？ :)提前谢谢。

解决方案：

sub merge_file3{
my ($file1,$file2,$out,$position,$hsize) = ($_[0],$_[1],$_[2],$_[3],$_[4]);
print "merging: \n$file1 and \n$file2, to: \n$out\n";
my $OUTSTRING = undef;
my $header;

my (@file1,@file2);
open FILE1, "<$file1" or die;
while (<FILE1>){
    if ($.==1){
        $header = $_;
        next;
    }
    print "$.\n" if ($.%100000) == 0;
    chomp;
    push @file1, [split ',', $_];
}
close FILE1;

open FILE2, "<$file2" or die;
while (<FILE2>){
    next if $.==1;
    print "$.\n" if ($.%100000) == 0;
    chomp;
    push @file2, [split ',', $_];
}
close FILE2;

print "sorting files\n";
my @sortedf1 = sort {$a->[$position] <=> $b->[$position]} @file1;
my @sortedf2 = sort {$a->[$position] <=> $b->[$position]} @file2;   
print "sorted\n";
@file1 = undef;
@file2 = undef;
#foreach my $line (@file1){print "\t [ @$line ],\n";    }

my ($i,$j) = (0,0);
while ($i < $#sortedf1 and $j < $#sortedf2){
    my $key1 = $sortedf1[$i][$position];
    my $key2 = $sortedf2[$j][$position];
    if ($key1 eq $key2){
        foreach(0..$hsize){ #header size.
            $sortedf2[$j][$_] = $sortedf1[$i][$_] if $sortedf1[$i][$_] ne undef;
        }
        $i++;
        $j++;
    }
    elsif ( $key1 < $key2){
        push(@sortedf2,[@{$sortedf1[$i]}]);
        $i++;
    }
    elsif ( $key1 > $key2){ 
        $j++;
    }
}

#foreach my $line (@sortedf2){print "\t [ @$line ],\n"; }

print "outputting to file\n";
open OUT, ">$out";
print OUT $header;
foreach(@sortedf2){
    print OUT (join ",", @{$_})."\n";
}
close OUT;

}

谢谢大家，解决方案已在上面发布。现在合并整个事情需要大约1分钟！：）

Answer 1

我想到了两种技巧。

将CSV文件中的数据读入DBMS中的两个表（SQLite可以正常工作），然后使用DB进行连接并将数据写回CSV。数据库将使用索引来优化连接。
首先，按主键对每个文件进行排序（使用perl或unix sort），然后对每个文件进行并行线性扫描（从每个文件中读取一条记录;如果密钥相等则输出连接的行并前进两个文件;如果键不相等，则使用较小的键前进文件并再次尝试）。此步骤为O（n + m）时间而不是O（n * m）和O（1）存储器。

Answer 2

这个代码的性能是什么，这个代码连接数百万次。

$OUTSTRING.=$outstring."\n";

....

foreach my $key (sort { $a <=> $b } keys %line_for){
    $OUTSTRING.= $file1array[$line_for{$key}]."\n";
}

如果只想写入输出文件一次，请将结果累积到数组中，然后使用join在最后打印它们。或者，甚至可能更好，在结果中包含换行符并直接编写数组。

要查看在处理大数据时串联如何不扩展，请试用此演示脚本。当你以concat模式运行它时，在几十万个连接之后事情开始显着减慢 - 我放弃并杀死了脚本。相比之下，只需在我的机器上打印一百万行的数组就不到一分钟。

# Usage: perl demo.pl 50 999999 concat|join|direct
use strict;
use warnings;

my ($line_len, $n_lines, $method) = @ARGV;
my @data = map { '_' x $line_len . "\n" } 1 .. $n_lines;

open my $fh, '>', 'output.txt' or die $!;

if ($method eq 'concat'){         # Dog slow. Gets slower as @data gets big.
    my $outstring;
    for my $i (0 .. $#data){
        print STDERR $i, "\n" if $i % 1000 == 0;
        $outstring .= $data[$i];
    }
    print $fh $outstring;
}
elsif ($method eq 'join'){        # Fast
    print $fh join('', @data);
}
else {                            # Fast
    print $fh @data;
}

Answer 3

我看不到任何让我感到震惊的事情显然很慢，但我会做出这些改变：

首先，我要删除@file1array变量。你不需要它;只需将行本身存储在哈希中：
```
while (<FILE1>){
     chomp;
     $line_for{read_csv_string($_,$position)}=$_;
}
```
其次，虽然这不应该与perl有太大区别，但我不会一直添加$OUTSTRING。相反，每次都要在其上保留一组输出行和push。如果出于某种原因，您仍需要使用大量字符串调用write_line，则最后可以使用join('', @OUTLINES)。
如果write_line没有使用syswrite或类似的低级别，而是使用print或其他基于stdio的调用，那么您就不会保存任何磁盘通过在内存中构建输出文件来写入。因此，您可能根本不在内存中构建输出，而是在创建时将其写出来。当然，如果您使用syswrite，请忘记这一点。
由于没有明显的慢，请尝试在您的代码中放置Devel::SmallProf。我发现它是制作那些“哦！那个的慢线！”的最好的perl剖析器。的见解。

Answer 4

如果你想要合并，你应该真的合并。首先，您必须按键排序数据，然后合并！你甚至会在性能上击败MySQL。我有很多经验。

你可以写下这些内容：

#!/usr/bin/env perl
use strict;
use warnings;

use Text::CSV_XS;
use autodie;

use constant KEYPOS => 4;

die "Insufficient number of parameters" if @ARGV < 2;
my $csv = Text::CSV_XS->new( { eol => $/ } );
my $sortpos = KEYPOS + 1;
open my $file1, "sort -n -k$sortpos -t, $ARGV[0] |";
open my $file2, "sort -n -k$sortpos -t, $ARGV[1] |";
my $row1 = $csv->getline($file1);
my $row2 = $csv->getline($file2);
while ( $row1 and $row2 ) {
    my $row;
    if ( $row1->[KEYPOS] == $row2->[KEYPOS] ) {    # merge rows
        $row  = [ map { $row1->[$_] || $row2->[$_] } 0 .. $#$row1 ];
        $row1 = $csv->getline($file1);
        $row2 = $csv->getline($file2);
    }
    elsif ( $row1->[KEYPOS] < $row2->[KEYPOS] ) {
        $row  = $row1;
        $row1 = $csv->getline($file1);
    }
    else {
        $row  = $row2;
        $row2 = $csv->getline($file2);
    }
    $csv->print( *STDOUT, $row );
}

# flush possible tail
while ( $row1 ) {
    $csv->print( *STDOUT, $row1 );
    $row1 = $csv->getline($file1);
}
while ( $row2 ) {
    $csv->print( *STDOUT, $row2 );
    $row2 = $csv->getline($file1);
}
close $file1;
close $file2;

将输出重定向到文件并进行测量。

如果您希望在排序参数方面更加理智，可以使用

替换文件打开部分

(open my $file1, '-|') || exec('sort',  '-n',  "-k$sortpos",  '-t,',  $ARGV[0]);
(open my $file2, '-|') || exec('sort',  '-n',  "-k$sortpos",  '-t,',  $ARGV[1]);

Answer 5

假设你的文件大约有20个字节，那么大约20 MB，这不算太大。由于您使用的是哈希，因此时间复杂度似乎不是问题。

在你的第二个循环中，你正在为每一行打印到控制台，这个位很慢。尝试删除它应该有很多帮助。您还可以避免在第二个循环中删除。

一次读取多行也应该有所帮助。但我认为并不是太多，在幕后总是会有一个阅读。

Answer 6

我将每条记录存储在一个哈希中，该哈希的键是主键。给定主键的值是对CSV值数组的引用，其中undef表示未知值。

use 5.10.0;  # for // ("defined-or")
use Carp;
use Text::CSV;

sub merge_csv {
  my($path,$record) = @_;

  open my $fh, "<", $path or croak "$0: open $path: $!";

  my $csv = Text::CSV->new;
  local $_;
  while (<$fh>) {
    if ($csv->parse($_)) {
      my @f = map length($_) ? $_ : undef, $csv->fields;
      next unless @f >= 1;

      my $primary = pop @f;
      if ($record->{$primary}) {
        $record->{$primary}[$_] //= $f[$_]
          for 0 .. $#{ $record->{$primary} };
      }
      else {
        $record->{$primary} = \@f;
      }
    }
    else {
      warn "$0: $path:$.: parse failed; skipping...\n";
      next;
    }
  }
}

您的主程序将类似于

my %rec;
merge_csv $_, \%rec for qw/ file1 file2 /;

Data::Dumper模块显示给出问题的简单输入的结果哈希是

$VAR1 = {
  '42' => [
    'one',
    'two',
    'three',
    'four'
  ]
};

Perl使用主键逐行合并2个csv文件

6 个答案: