在第二列中按匹配对不同列的行求和

时间:2018-02-12 17:45:34

标签: shell perl awk sed

我有一个这样的文件:

ID  Category Sample1 Sample2 Sample3
1   A        5       5       5
2   A        5       5       5
3   A        5       5       5
4   B        1       2       3
5   B        1       2       3

我正在寻找一个awk,sed或类似的解决方案来实现这一目标:

ID  Category Sample1 Sample2 Sample3
1   A        15      15      15
4   B        2       4       6

我们的想法是对匹配类别的行进行求和,考虑每个样本列的值并删除重复的ID。

1 个答案:

答案 0 :(得分:1)

DATA1.TXT:

ID  Category Sample1 Sample2 Sample3
1   A        5       5       5
2   A        5       5       5
3   A        5       5       5
4   B        1       2       3
5   B        1       2       3

代码:

use strict;
use warnings; 
use 5.020;
use autodie;
use Data::Dumper;
use List::MoreUtils qw{ uniq };

open my $INFILE, '<', 'data1.txt';

my %results;
my @cat_order;

#Get names of column headers:
my ($h1, $h2, @sample_names) =  split ' ', <$INFILE>;

while (my $line = <$INFILE>) {
    my($id, $cat, @samples)= split ' ', $line;
    push @cat_order, $cat;

    push @{ $results{$cat}{$h1} }, $id;  #e.g. push results{A}{ID}, 1 

    while ( my($i, $sample) = each @samples ) {
        $results{$cat}{ $sample_names[$i] } += $sample;  #e.g. results{A}{Sample1} += 5   
    }
}

close $INFILE;
open my $OUTFILE, '>', 'results.txt';

my $format = "%-3s %-9s %-9s %-9s %-9s\n";
printf {$OUTFILE} $format, $h1, $h2, @sample_names;

for my $cat (uniq(@cat_order)) {
    printf( $OUTFILE 
        $format, 
        $results{$cat}{$h1}[0],  #e.g. results{A}{ID}[0], which is the first id in the ID array
        $cat,  #e.g. A
        @{ $results{$cat} }{@sample_names}  #e.g. results{A}{'Sample1', 'Sample2', 'Sample3'}  -- a hash slice,
                                            #which returns an array of the values matching those keys.
    )
}

close $OUTFILE;

输出:

$ rm results.txt
remove results.txt? y

$ perl 1.pl 

$ cat results.txt 
ID  Category  Sample1   Sample2   Sample3  
1   A         15        15        15       
4   B         2         4         6  

使用data1.txt:

ID  Category Sample1 Sample2 Sample3
1   A        5       5       5
2   B        1       2       3
3   A        5       5       5
4   C        10      11      12
5   B        1       2       3
6   A        5       5       5
7   C        1       1       1

输出:

$ rm results.txt
remove results.txt? y

$ perl 1.pl 
$ cat results.txt 
ID  Category  Sample1   Sample2   Sample3  
1   A         15        15        15       
2   B         2         4         6        
4   C         11        12        13  

%results

$VAR1 = {
          'C' => {
                   'ID' => [
                             '4',
                             '7'
                           ],
                   'Sample1' => 11,
                   'Sample3' => 13,
                   'Sample2' => 12
                 },
          'B' => {
                   'ID' => [
                             '2',
                             '5'
                           ],
                   'Sample1' => 2,
                   'Sample3' => 6,
                   'Sample2' => 4
                 },
          'A' => {
                   'Sample1' => 15,
                   'ID' => [
                             '1',
                             '3',
                             '6'
                           ],
                   'Sample2' => 15,
                   'Sample3' => 15
                 }
        };