我有一个这样的文件:
ID Category Sample1 Sample2 Sample3
1 A 5 5 5
2 A 5 5 5
3 A 5 5 5
4 B 1 2 3
5 B 1 2 3
我正在寻找一个awk,sed或类似的解决方案来实现这一目标:
ID Category Sample1 Sample2 Sample3
1 A 15 15 15
4 B 2 4 6
我们的想法是对匹配类别的行进行求和,考虑每个样本列的值并删除重复的ID。
答案 0 :(得分:1)
DATA1.TXT:
ID Category Sample1 Sample2 Sample3
1 A 5 5 5
2 A 5 5 5
3 A 5 5 5
4 B 1 2 3
5 B 1 2 3
代码:
use strict;
use warnings;
use 5.020;
use autodie;
use Data::Dumper;
use List::MoreUtils qw{ uniq };
open my $INFILE, '<', 'data1.txt';
my %results;
my @cat_order;
#Get names of column headers:
my ($h1, $h2, @sample_names) = split ' ', <$INFILE>;
while (my $line = <$INFILE>) {
my($id, $cat, @samples)= split ' ', $line;
push @cat_order, $cat;
push @{ $results{$cat}{$h1} }, $id; #e.g. push results{A}{ID}, 1
while ( my($i, $sample) = each @samples ) {
$results{$cat}{ $sample_names[$i] } += $sample; #e.g. results{A}{Sample1} += 5
}
}
close $INFILE;
open my $OUTFILE, '>', 'results.txt';
my $format = "%-3s %-9s %-9s %-9s %-9s\n";
printf {$OUTFILE} $format, $h1, $h2, @sample_names;
for my $cat (uniq(@cat_order)) {
printf( $OUTFILE
$format,
$results{$cat}{$h1}[0], #e.g. results{A}{ID}[0], which is the first id in the ID array
$cat, #e.g. A
@{ $results{$cat} }{@sample_names} #e.g. results{A}{'Sample1', 'Sample2', 'Sample3'} -- a hash slice,
#which returns an array of the values matching those keys.
)
}
close $OUTFILE;
输出:
$ rm results.txt
remove results.txt? y
$ perl 1.pl
$ cat results.txt
ID Category Sample1 Sample2 Sample3
1 A 15 15 15
4 B 2 4 6
使用data1.txt:
ID Category Sample1 Sample2 Sample3
1 A 5 5 5
2 B 1 2 3
3 A 5 5 5
4 C 10 11 12
5 B 1 2 3
6 A 5 5 5
7 C 1 1 1
输出:
$ rm results.txt
remove results.txt? y
$ perl 1.pl
$ cat results.txt
ID Category Sample1 Sample2 Sample3
1 A 15 15 15
2 B 2 4 6
4 C 11 12 13
%results
:
$VAR1 = {
'C' => {
'ID' => [
'4',
'7'
],
'Sample1' => 11,
'Sample3' => 13,
'Sample2' => 12
},
'B' => {
'ID' => [
'2',
'5'
],
'Sample1' => 2,
'Sample3' => 6,
'Sample2' => 4
},
'A' => {
'Sample1' => 15,
'ID' => [
'1',
'3',
'6'
],
'Sample2' => 15,
'Sample3' => 15
}
};