Question

从Perl中有大约200万条记录的文件中汇总数据的最佳方法是什么？

例如：像这样的文件，

ABC | XYZ | DEF | EGH | 100

ABC | XYZ | DEF | FGH | 200

SDF | GHT | WWW | RTY | 1000

SDF | GHT | WWW | TYU | 2000

需要在前3列中进行总结，如下所示，

ABC | XYZ | DEF | 300

SDF | GHT | WWW | 3000

克里斯

Answer 1

假设总有五列，其中第五列是数字，你总是希望前三列成为关键...

use warnings;
use strict;

my %totals_hash;

while (<>)
{
  chomp;
  my @cols = split /\|/;

  my $key = join '|', @cols[0..2];

  $totals_hash{$key} += $cols[4];
}

foreach (sort keys %totals_hash)
{
  print $_, '|', $totals_hash{$_}, "\n";
}

Answer 2

您可以将哈希用作：

my %hash;
while (<DATA>) {
        chomp;
        my @tmp = split/\|/;     # split each line on |
        my $value = pop @tmp;    # last ele is the value
        pop @tmp;                # pop unwanted entry
        my $key = join '|',@tmp; # join the remaining ele to form key

        $hash{$key} += $value;   # add value for this key
}

# print hash key-values.
for(sort keys %hash) {
        print $_ . '|'.$hash{$_}."\n";
}

Ideone link

Answer 3

假设您的输入文件在单独的行中有记录。

perl -n -e 'chomp;@a=split/\|/;$h{join"|",splice@a,0,3}+=pop@a;END{print map{"$_: $h{$_}\n"}keys%h}' < inputfile

Answer 4

1-2-3-4我宣布一个CODE-GOLF WAR !!! （好吧，一个相当可读的代码 - 高尔夫尘埃落定。）

my %sums;
m/([^|]+\|[^|]+\|[^|]+).*?\|(\d+)/ and $sums{ $1 } += $2 while <>;
print join( "\n", ( map { "$_|$sums{$_}" } sort keys %sums ), '' );

Answer 5

排序以将具有相同前3个三元组的所有记录彼此相邻。当出现一组不同的三元组时，迭代并踢出小计。

$prevKey="";
$subtotal=0;
open(INPUTFILE, "<$inFile");
@lines=<INPUTFILE>;
close (INPUTFILE);
open(OUTFILE, ">$outFile");
@sorted=sort(@lines);
foreach $line(@lines){
    @parts=split(/\|/g, $line);
    $value=pop(@parts);
    $value-=0; #coerce string to a number
    $key=$parts[0]."|".$parts[1]."|".$parts[2];
    if($key ne $prevKey){
        print OUTFILE "$prevKey|$subtotal\n";
        $prevKey=$key;
        $subtotal=0;
        }
    $subtotal+=$value;
    }
close(OUTFILE);

如果你的盒子有200万个扼流圈，那么你可能不得不将每个记录放入一个基于该组的文件中，然后对每个文件进行小计。

Perl - 汇总文件中的数据

5 个答案: