我的编码问题超出了我使用unix电动工具的有限技能。我想用以下两种方法计算样品的数量:i)基因中的纯合变体(下面的BB);或ii)基因中的两个变体(2x AB)。例如,来自:
Variant Gene Sample1 Sample2 Sample3
1 TP53 AA BB AB
2 TP53 AB AA AB
3 TP53 AB AA AA
4 KRAS AA AB AA
5 KRAS AB AB BB
我正在寻找:
Gene Two_variants Homozygous Either
TP53 2 1 3
KRAS 1 1 2
非常感谢任何帮助。感谢。
R_G
答案 0 :(得分:1)
在GNU awk
:
awk '/\<AB\>.+\<AB\>/ { arr[$2,"AB"] += 1 }
/\<BB\>/ { arr[$2,"BB"] += 1 }
END { for ( elt in arr ) {
split ( elt, index_parts, SUBSEP )
genes[index_parts[1]] = 0
}
printf "%4s%13s%11s%7s\n", "Gene", "Two_variants", "Homozygous", "Either"
for ( gene in genes ) {
printf "%4s%6d%13d%9d\n", gene, arr[gene,"AB"], arr[gene,"BB"], arr[gene,"AB"] + arr[gene,"BB"]
}
}' input.txt
答案 1 :(得分:0)
use warnings;
use strict;
my (@header, %data);
open(my $file, "<", "input") or die("$?");
while (<$file>) {
@header = split, next if not @header;
my @v = split;
$data{$v[1]}->{$_}++ for (@v[2..$#v]);
}
close $file;
print "Gene Two_variants Homozygous Either\n";
for my $k (keys %data) {
my ($var2, $homoz) = (int($data{$k}{AB}/2), $data{$k}{BB});
my $sum = $var2 + $homoz;
printf("%4s %8d %9d %8d\n", $k, $var2, $homoz, $sum) if $sum;
}