使用perl计算csv文件中的平均值作为上移文件收集数字

时间:2015-11-12 00:20:04

标签: perl csv average

我有一个包含三列的CSV文件,名为Mb_size,tax_id和parent_id。 tax_id和parent_id之间存在关系,例如,在ms大小为22.2220658537的末尾的csv文件中,5820是税号,5819是父ID。随着向上移动文件5819,将在税收标识列中看到父标识。父ID可以重复,但税号在其列中是唯一的。

从Mb_size中具有值的末尾开始,每当parent_id变为tax_id时,我需要计算最高值来计算平均值。然后向上移动,当发生这种情况时,该税收ID旁边的父ID将成为新的起点向上移动。

以下是示例输入:

Mb_size,tax_id,parent_id
,1,1
,131567,1
,2759,131567
,5819,2759
,147429,2759
22.2220658537,5820,5819
184.801317,4557,147429
748.66869,4575,147429
555.55,1234,5819

以下是示例输出:

 Mb_size,tax_id,parent_id
 377.810518214,1,1
 377.810518214,131567,1
 377.810518214,2759,131567
 288.886032927,5819,2759
 466.7350035,147429,2759
 22.2220658537,5820,5819
 184.801317,4557,147429
 748.66869,4575,147429
 555.55,1234,5819,

到目前为止的代码

 use strict;
 use warnings;
 no warnings 'numeric';

  open taxa_fh, '<', "$ARGV[0]" or die qq{Failed to open "$ARGV[1]" for input: $!\n};
  open match_fh, ">$ARGV[0]_sized.csv" or die qq{Failed to open for output: $!\n};

  my %data;

  while ( my $line = <taxa_fh> ) {

  chomp( $line );

    my @fields    = split( /,/, $line );
    my $Mb_size   = $fields[0];
    my $tax_id    = $fields[1];
    my $parent_id = $fields[2];

    $data{$parent_id}{sum} += $Mb_size;
    $data{$parent_id}{count}++;
   }

    for my $parent_id ( sort keys %data ) {
    my $avg = $data{$parent_id}{sum} / $data{$parent_id}{count};
    print match_fh "$parent_id, $avg \n";

    }

   close taxa_fh;
   close match_fh;

我到目前为止的代码来自早期的帮助海报。我编辑了这个问题,以帮助它变得更好/更清晰。我不能让它继续计算,并在打印中包括从下面的原始线。 我试过一个foreach(tax_id),但没有工作。任何建议包括完成此任务。它确实向上移动但不进行计算。

2 个答案:

答案 0 :(得分:1)

您需要首先从头到尾仔细构建数据结构。我正在使用hashes

此处每个parent_id作为关键字我正在构建一个哈希,我将保存averagestax_idsumcount

由于可能有多个tax_id与单个parent_id相关联,我们需要分别为它们存储平均值。

现在当它变成树状结构时,根据我们的要求将它打印出来变得微不足道。 因为它们是哈希值,所以订单不守恒。为了维护订单,您可以使用arrays代替hashes

一种方法如下:

#!/usr/bin/perl
use strict;
use warnings;

open my $fh, '<', 'tax' or die "unable to open file:$!\n";

my %data;
my @lines;
chomp(my $header=<$fh>); #slurp header
while(<$fh>){
chomp;
my @fields=split(/,/);
  if($fields[0]){
     ##actually field0 is avg so storing it as avg here
    $data{$fields[2]}{$fields[1]}{avg}=$fields[0];
    $data{$fields[2]}{sum}+=$fields[0];
    $data{$fields[2]}{count}++;
  }
  else{
       push(@lines,[split(/,/)]);
     }
}
close($fh);
@lines=reverse @lines;
foreach my $lines(@lines){
 if(exists $data{$lines->[1]}){
     $data{$lines->[2]}{$lines->[1]}{avg}=($data{($lines->[1])}{sum})/($data{($lines->[1])}{count});
     $data{$lines->[2]}{sum}+=$data{$lines->[2]}{$lines->[1]}{avg};
     $data{$lines->[2]}{count}++;
 }
else{
   print "Sorry No Such Entry ",$lines->[2]," present\n";
 }
}
print "$header\n";
foreach my $tax_id(keys %data){
    foreach my $parent_id(keys $data{$tax_id} ){
       if(ref ($data{$tax_id}{$parent_id}) eq 'HASH'){
          print $data{$tax_id}{$parent_id}->{'avg'}.",".$tax_id.",".$parent_id."\n";
       }
}
}

答案 1 :(得分:1)

根据您的工作,这是另一个类似的解决方案:

use strict;
use warnings;

open taxa_fh, '<', "$ARGV[0]" or die qq{Failed to open "$ARGV[1]" for input: $!\n};
open match_fh, ">$ARGV[0]_sized.csv" or die qq{Failed to open for output: $!\n};

my %node_data;
my %parent;
my @node_order;
my $header;
while ( my $line = <taxa_fh> ) {
    chomp( $line );

    if (1 == $.) {
        $header = $line;
        next; # Skip header
    }

    my @fields    = split( /,/, $line );
    my $Mb_size   = $fields[0] || 0; # To avoid uninitialized warning
    my $tax_id    = $fields[1];
    my $parent_id = $fields[2];

    $parent{$tax_id} = $parent_id;
    push @node_order, $tax_id;
    $node_data{$tax_id} = $Mb_size;
}

# Add the node value for all parents in the tree
my %totals;
for my $tax_id ( sort keys %parent ) {
    my $parent = $parent{$tax_id};
    my $done = 0;
    while( ! $done ) {
        if ($node_data{$tax_id} > 0) {
            $totals{$parent}->{sum} += $node_data{$tax_id};
            $totals{$parent}->{count}++;
        }
        $done++ if ($parent{$parent} == $parent);
        $parent = $parent{$parent};
    }
}

print match_fh "$header\n";
for my $id ( @node_order ) {
    my $avg;
    if ( exists $totals{$id} ) {
        # Parent Node
        $avg = $totals{$id}->{sum} / $totals{$id}->{count};        
    } else {
        # Leaf Node
        $avg = $node_data{$id};
    }

    print match_fh "$avg, $id, " . $parent{$id} . "\n";
}

close taxa_fh;
close match_fh;

<强>输出:

Mb_size,tax_id,parent_id
377.810518213425, 1, 1
377.810518213425, 131567, 1
377.810518213425, 2759, 131567
288.88603292685, 5819, 2759
466.7350035, 147429, 2759
22.2220658537, 5820, 5819
184.801317, 4557, 147429
748.66869, 4575, 147429
555.55, 1234, 5819